How to create/customize your own scorer function in scikit-learn?
I am using Support Vector Regression as an estimator in GridSearchCV. But I want to change the error function: instead of using the default (R-squared: coefficient of determination), I would like to define my own custom error function.
I tried to make one with make_scorer
, but it didn't work.
I read the documentation and found that it's possible to create custom estimators, but I don't need to remake the entire estimator - only the error/scoring function.
I think I can do it by defining a callable as a scorer, like it says in the docs.
But I don't know how to use an estimator: in my case SVR. Would I have to switch to a classifier (such as SVC)? And how would I use it?
My custom error function is as follows:
def my_custom_loss_func(X_train_scaled, Y_train_scaled):
error, M = 0, 0
for i in range(0, len(Y_train_scaled)):
z = (Y_train_scaled[i] - M)
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) > 0:
error_i = (abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z))
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) < 0:
error_i = -(abs((Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z)))
if X_train_scaled[i] > M and Y_train_scaled[i] < M:
error_i = -(abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(-z))
error += error_i
return error
The variable M
isn't null/zero. I've just set it to zero for simplicity.
Would anyone be able to show an example application of this custom scoring function? Thanks for your help!
Solution 1:
Jamie has a fleshed out example, but here's an example using make_scorer straight from scikit-learn documentation:
import numpy as np
def my_custom_loss_func(ground_truth, predictions):
diff = np.abs(ground_truth - predictions).max()
return np.log(1 + diff)
# loss_func will negate the return value of my_custom_loss_func,
# which will be np.log(2), 0.693, given the values for ground_truth
# and predictions defined below.
loss = make_scorer(my_custom_loss_func, greater_is_better=False)
score = make_scorer(my_custom_loss_func, greater_is_better=True)
ground_truth = [[1, 1]]
predictions = [0, 1]
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy='most_frequent', random_state=0)
clf = clf.fit(ground_truth, predictions)
loss(clf,ground_truth, predictions)
score(clf,ground_truth, predictions)
When defining a custom scorer via sklearn.metrics.make_scorer
, the convention is that custom functions ending in _score
return a value to maximize. And for scorers ending in _loss
or _error
, a value is returned to be minimized. You can use this functionality by setting the greater_is_better
parameter inside make_scorer
. That is, this parameter would be True
for scorers where higher values are better, and False
for scorers where lower values are better. GridSearchCV
can then optimize in the appropriate direction.
You can then convert your function as a scorer as follows:
from sklearn.metrics.scorer import make_scorer
def custom_loss_func(X_train_scaled, Y_train_scaled):
error, M = 0, 0
for i in range(0, len(Y_train_scaled)):
z = (Y_train_scaled[i] - M)
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) > 0:
error_i = (abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z))
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) < 0:
error_i = -(abs((Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z)))
if X_train_scaled[i] > M and Y_train_scaled[i] < M:
error_i = -(abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(-z))
error += error_i
return error
custom_scorer = make_scorer(custom_loss_func, greater_is_better=True)
And then pass custom_scorer
into GridSearchCV
as you would any other scoring function: clf = GridSearchCV(scoring=custom_scorer)
.
Solution 2:
As you saw, this is done by using make_scorer
(docs).
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.svm import SVR
import numpy as np
rng = np.random.RandomState(1)
def my_custom_loss_func(X_train_scaled, Y_train_scaled):
error, M = 0, 0
for i in range(0, len(Y_train_scaled)):
z = (Y_train_scaled[i] - M)
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) > 0:
error_i = (abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z))
if X_train_scaled[i] > M and Y_train_scaled[i] > M and (X_train_scaled[i] - Y_train_scaled[i]) < 0:
error_i = -(abs((Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(z)))
if X_train_scaled[i] > M and Y_train_scaled[i] < M:
error_i = -(abs(Y_train_scaled[i] - X_train_scaled[i]))**(2*np.exp(-z))
error += error_i
return error
# Generate sample data
X = 5 * rng.rand(10000, 1)
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 3 * (0.5 - rng.rand(X.shape[0]/5))
train_size = 100
my_scorer = make_scorer(my_custom_loss_func, greater_is_better=True)
svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1),
scoring=my_scorer,
cv=5,
param_grid={"C": [1e0, 1e1, 1e2, 1e3],
"gamma": np.logspace(-2, 2, 5)})
svr.fit(X[:train_size], y[:train_size])
print svr.best_params_
print svr.score(X[train_size:], y[train_size:])