CatBoostError: cat_features must be integer or string, real number values and NaN values should be converted to string

I have a dataset with 122 columns which looks like:

train.head()

SK_ID_CURR  TARGET  NAME_CONTRACT_TYPE  CODE_GENDER FLAG_OWN_CAR    FLAG_OWN_REALTY CNT_CHILDREN    AMT_INCOME_TOTAL    AMT_CREDIT  AMT_ANNUITY ... FLAG_DOCUMENT_18    FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21    AMT_REQ_CREDIT_BUREAU_HOUR  AMT_REQ_CREDIT_BUREAU_DAY   AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON   AMT_REQ_CREDIT_BUREAU_QRT   AMT_REQ_CREDIT_BUREAU_YEAR
0   100002  1   Cash loans  M   N   Y   0   202500.0    406597.5    24700.5 ... 0   0   0   0   0   0   0   0   0   1
1   100003  0   Cash loans  F   N   N   0   270000.0    1293502.5   35698.5 ... 0   0   0   0   0   0   0   0   0   0
2   100004  0   Revolving loans M   Y   Y   0   67500.0 135000.0    6750.0  ... 0   0   0   0   0   0   0   0   0   0
3   100006  0   Cash loans  F   N   Y   0   135000.0    312682.5    29686.5 ... 0   0   0   0   255 255 255 255 65535   255
4   100007  0   Cash loans  M   N   Y   0   121500.0    

I've imputed all NaNs and wanna use CatBoost now as follows:

# Get variables for a model
x = train.drop(["TARGET"], axis=1)
y = train["TARGET"]

#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

cat_features = np.where(x.dtypes != float)[0]

cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy', loss_function='Logloss')

cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)

pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))

which raises:

CatBoostError: Invalid type for cat_feature[534,6]=118975.5 : cat_features must be integer or string, real number values and NaN values should be converted to string.

NaNs all are imputed as mentioned (checked) and in the code is stated that cat_features are other than real numbers.

Would someone help me to solve the mystery, please?


You are trying to use a column with dtype float for categorical column. To fix the error convert it to an int;

train["a"] = train["a"].astype(np.int) 

however, in your case 118975.5 doesn't look like a valid category, so you might want to double check if you want to use that column as categorical.

Here is small example that reproduces the error and fix:

from catboost import CatBoostRegressor
import numpy as np
import pandas as pd

train_data = [[1, 4],
              [4.0, 5]]

train = pd.DataFrame(train_data, columns=["a", "b"])

# train["a"] = train["a"].astype(np.int) # This line fixes Invalid type for cat_feature issue

train_labels = [10, 20]
model = CatBoostRegressor(iterations=2,
                          cat_features=["a"]
                          )
model.fit(train, train_labels)

It wasn't exactly a solution, but I figure that 'cat_feature[534,6]=118975.5' tell you that there is some problem on the 7th column.

I'm facing a similar problem now.