Compute R^2 Score for Lasso Regression Against Specific Model in scikit-learn

I have regularized Lasso models based on PolynomialRegression features of four degrees (1, 3, 7, 11) on pre-trained data in sci-kit learn. I generated predictions for 100 evenly spaced points on the interval [0, 20] and stored the results in a numpy array. My task is to return the 𝑅^2 score for each of the Lasso models relative to a new 'gold standard' test set generated from the true underlying cubic polynomial model without noise. The initial model, from which my below code is based, includes a noise constant. I have to compute this new test set by computing the true noise-less underlying function t^3/20 - t^2 - t for each of 100 evenly spaced points on the interval [0, 20], and ultimately select the degree which has the R^2 that gives the best fit on the given function. Here is my code so far:

degs = (1, 3, 7, 11)

las_r2 = []
    
preds = np.zeros((4,100))

for i, deg in enumerate(degs):
    poly = PolynomialFeatures(degree=deg)
    X_poly = poly.fit_transform(X_train)
    linlasso = Lasso(alpha=0.01, max_iter = 10000).fit(X_poly, y_train)
    y_poly = linlasso.predict(poly.fit_transform(np.linspace(0,20,100).reshape(-1,1)));
    preds[i,:] = y_poly.transpose()
        
    X_test_poly = poly.fit_transform(X_test)
    las_r2.append(linlasso.score(X_test_poly, y_test))
    
answer = las_r2.max()
    

What I don't know is how to how to incorporate that "gold standard" function provided in the above paragraph into my for-loop.


Clearing the air

I chewed on this for a bit. Let me simplify the question, correct me if this is wrong.

Given a gold standard function t^3/20 - t^2 - t, and a set of noisy data which generally fits that gold standard function, how can I deploy my modeling strategy such that I can find the modeling hyperparameters which best fits the gold standard? my metric of defining the best fit is R^2

In your case, the hyperparameters is solely comprised of degs = (1, 3, 7, 11)


Answer

So you have these inputs into your problem:

  • X: the input range (np.linspace(0,20,100))
  • a "gold standard" t^3/20 - t^2 - t
  • y_noisy: noisy values which approximates the gold standard
  • a range of polynomial degrees degs = (1, 3, 7, 11) you want to test

to me, the following strategy seems like it makes sense:

#for calculating R^2
from sklearn.metrics import r2_score

#x range
X = np.linspace(0,20,100)
#noisy data
y_noisy = "I'm not sure how you're getting this data.\
It could either be generated by applying random variation\
to y_gold, or it could be loaded"
#real data
y_gold = [t**(3/20) - t**2 - t for t in X]

#defining helper function
def generate_predictions(X, y_noisy, deg):
  """does all your modeling stuff, outputs predicted y values
  """
  ...
  return y_pred

#for storing R^2 rsults
results = {}

#get R^2 for each degree
for deg in [1,3,7,11]:
  y_pred = generate_predictions(X, y_noisy, deg)
  results[deg] = r2_score(y_gold, y_pred)

#prints the results for each degree
print(results)
  

A note

In the real world, "Gold standards" are few and far between, and there would be no strong reason to use ml models when you have the gold standard to begin with. As a result, using training and testing data doesn't make sense in this application, as comparing directly to the gold standard is theoretically perfect. However, in the real world this strategy would look a lot like overfitting.