sklearn and large datasets

Solution 1:

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.

You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

Solution 2:

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.

This link may be useful... Working with big data in python and numpy, not enough ram, how to save partial results on disc?

I agree that h5py is useful but you may wish to use tools that are already in your quiver.

Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Solution 3:

You may want to take a look at Dask or Graphlab

  • http://dask.pydata.org/en/latest/

  • https://turi.com/products/create/

They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.

Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.