Set up a KNN for a score
I try to fill nan on the column "score" using a KNN (based on values from columns X_100g, Y_100g and Z_100g.
Here is my df:
Product_Name brand score X_100g Y_100g Z_100g
PA abc a 40 45 na
PB def b 27 27 8
PC ghi na 78 na 56
PD klm c na 29 29
PE nop b 57 3 76
PF qrs na 45 42 33
What I tried is :
imputer = KNNImputer(n_neighbors=5)
dataknn = imputer.fit_transform(data.filter("score"))
It seems that it doesn't work due to an error: "ValueError: at least one array or dtype is required"
Anny help to help me to solve that?
Thx!
I tried to change my initial code for:
imputer = SimpleImputer(strategy = "most_frequent")
dataimputed = imputer.fit_transform(data.filter(["score"]))
As a result I have the following error: "ValueError: cannot reindex from a duplicate axis"
Mistake 1:
df.filter('score')
returns an empty dataframe.
This is because Pandas expects a list-like object as the 'items'
parameter (i.e., a list of the names of the columns you want to select), refer docs. However, you are supplying a str
.
Do a df.filter(['score'])
, or just a df['score']
to extract the 'score'
column as a dataframe.
Mistake 2:
You are using KNNImputer
with categorical variables, which is not possible as it works only on numeric data.
Instead, use SimpleImputer
(or IterativeImputer
) with the 'most_frequent'
or 'constant'
strategies – these work with categorical data.
If you really wish to use KNNImputer
, first encode the 'score'
column, impute the null values and then convert back.