CatBoostError: cat_features must be integer or string, real number values and NaN values should be converted to string
I have a dataset with 122 columns which looks like:
train.head()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0 0 0 0 0 1
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0 0 0 0 0 0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0 0 0 0 0 0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 255 255 255 255 65535 255
4 100007 0 Cash loans M N Y 0 121500.0
I've imputed all NaNs and wanna use CatBoost now as follows:
# Get variables for a model
x = train.drop(["TARGET"], axis=1)
y = train["TARGET"]
#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
cat_features = np.where(x.dtypes != float)[0]
cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy', loss_function='Logloss')
cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)
pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))
which raises:
CatBoostError: Invalid type for cat_feature[534,6]=118975.5 : cat_features must be integer or string, real number values and NaN values should be converted to string.
NaN
s all are imputed as mentioned (checked) and in the code is stated that cat_features
are other than real numbers.
Would someone help me to solve the mystery, please?
You are trying to use a column with dtype
float
for categorical column. To fix the error convert it to an int
;
train["a"] = train["a"].astype(np.int)
however, in your case 118975.5 doesn't look like a valid category, so you might want to double check if you want to use that column as categorical.
Here is small example that reproduces the error and fix:
from catboost import CatBoostRegressor
import numpy as np
import pandas as pd
train_data = [[1, 4],
[4.0, 5]]
train = pd.DataFrame(train_data, columns=["a", "b"])
# train["a"] = train["a"].astype(np.int) # This line fixes Invalid type for cat_feature issue
train_labels = [10, 20]
model = CatBoostRegressor(iterations=2,
cat_features=["a"]
)
model.fit(train, train_labels)
It wasn't exactly a solution, but I figure that 'cat_feature[534,6]=118975.5' tell you that there is some problem on the 7th column.
I'm facing a similar problem now.