How to handle large Sets of categorical Data
I'm a beginner in machine learning. i have a large Data Set with lots of categorical data. The data is nominal. I want to apply algorithmns like SVM and decision tree with Python and scikit-learn to to find patterns in the data.
My problem is, it that i dont know how to best handle that kind of data. I read a lot about One-Hot Encoding. The examples are all quite easy, like with three different colors. In my data there are around 30 different categorical features. And in those features are around 200 different "values". If i use simple One-Hot Encoding the data frame gets really big and i can hardly use any algorithm on the data because i run out of ram.
So whats the best approach here? Use a sql database for the encoded tables? How is this done in the "real" world?
Thanks in advance for your answers!
Sklearn does not handle categorical features with decision trees and random forest - it requires them to be converted to one-hot encoded columns. Realistically though, there are a slightly better alternative:
This is called binary encoding which will separate all type, much better than numerical encoding for categorical columns.
Another way to approach this problem is using clipping
. The idea of clipping is to only register the largest categories, e.g. All categories that account for 5%+ of all values, and encode the rest as 'tail'. This is another method to reduce dimensionality.