What is the difference between doing a regression with a dataframe and ndarray?

Solution 1:

So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.

train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).

I believe the preferred syntax for getting an array from the dataframe is:

 train[['ENGINESIZE']].values          # or
 train[['ENGINESIZE']].to_numpy()

though

 np.asanyarray(train[['ENGINESIZE']])

is supposed to do the same thing.

Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).

So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.