Pipeline OrdinalEncoder ValueError Found unknown categories
Solution 1:
Your problem is that the model has encountered a value in the test data that it had not seen in the training data. This is fine. You just need to add the 'handle_unknown' argument to your encoder.
You should fit
encoders and scalers to the training data (but not the test data) and then use them to transform
both training and test data. Thus, you must plan for the possibility of unexpected values in the test data.
Solution 2:
I'm late to the game but I landed on this page so I thought I would reply anyway.
You said it in your comment: "diabetes dataset has too many values in many of the features for a given test/train split to both mirror all the values"
This error happens with encoders when the testing set contains data not seen during the training.
Solution 3:
I had the exact same problem, I just used OneHotEncoder(handle_unknown='ignore')
instead of OneHotEncoder()
and the issue was fixed.
Solution 4:
I don't think OrdinalEncoder is the correct choice in this situation. The diabetes dataset is comprised of continuous features, not categorical features. As stated in the documentation for OrdinalEncoder
:
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.
That being said without additional output from the traceback or your setup I can't definitively say why you are getting the error you did. I was able to successfully split and execute the above code using the data loaded with the load_diabetes
function. My guess is that in your case you did somehow miss fitting the encoder with the category "17.0", but again I would not recommend the use of a categorical encoder in this case.