Why does one hot encoding improve machine learning performance? [closed]

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to prediction accuracy, compared to using the original matrix itself as training data. How does this performance increase happen?

Solution 1:

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain.

Suppose you have a dataset having only a single categorical feature "nationality", with values "UK", "French" and "US". Assume, without loss of generality, that these are encoded as 0, 1 and 2. You then have a weight w for this feature in a linear classifier, which will make some kind of decision based on the constraint w×x + b > 0, or equivalently w×x < b.

The problem now is that the weight w cannot encode a three-way choice. The three possible values of w×x are 0, w and 2×w. Either these three all lead to the same decision (they're all < b or ≥b) or "UK" and "French" lead to the same decision, or "French" and "US" give the same decision. There's no possibility for the model to learn that "UK" and "US" should be given the same label, with "French" the odd one out.

By one-hot encoding, you effectively blow up the feature space to three features, which will each get their own weights, so the decision function is now w[UK]x[UK] + w[FR]x[FR] + w[US]x[US] < b, where all the x's are booleans. In this space, such a linear function can express any sum/disjunction of the possibilities (e.g. "UK or US", which might be a predictor for someone speaking English).

Similarly, any learner based on standard distance metrics (such as k-nearest neighbors) between samples will get confused without one-hot encoding. With the naive encoding and Euclidean distance, the distance between French and US is 1. The distance between US and UK is 2. But with the one-hot encoding, the pairwise distances between [1, 0, 0], [0, 1, 0] and [0, 0, 1] are all equal to √2.

This is not true for all learning algorithms; decision trees and derived models such as random forests, if deep enough, can handle categorical variables without one-hot encoding.

Solution 2:

Regarding the increase of the features by doing one-hot-encoding one can use feature hashing. When you do hashing, you can specify the number of buckets to be much less than the number of the newly introduced features.

What reason is there to use null instead of undefined in JavaScript?

Finding out the name of the original repository you cloned from in Git

How to find reverse dependencies on npm package?

How to tell git to use the correct identity (name and email) for a given project?

Fill between two vertical lines in matplotlib [duplicate]

The iOS deployment target 'IPHONEOS_DEPLOYMENT_TARGET' is set to 8.0, in Flutter How can I change the minimum IOS Deploying Target

Multiple submit buttons on HTML form – designate one button as default [duplicate]

Javascript regex returning true.. then false.. then true.. etc [duplicate]

list_display - boolean icons for methods

How to change language settings in R

How do I reword the very first git commit message?

What are the key differences between Scala and Groovy? [closed]

Why does one hot encoding improve machine learning performance? [closed]

Solution 1:

Solution 2:

Related

Recent Posts