What is the difference between doing a regression with a dataframe and ndarray?
Solution 1:
So df
is the loaded dataframe, cdf
is another frame with selected columns, and train
is selected rows.
train[['ENGINESIZE']]
is a 1 column dataframe (I believe train['ENGINESIZE']
would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit
code I see that it calls sklearn.utils.check_X_y
which in turn calls sklearn.tils.check_array
. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit
accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.