Can anyone explain me StandardScaler?
Intro
I assume that you have a matrix X
where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn
ML function by the way -- X.shape
should be [number_of_samples, number_of_features]
).
Core of method
The main idea is to normalize/standardize i.e. μ = 0
and σ = 1
your features/variables/columns of X
, individually, before applying any machine learning model.
StandardScaler()
will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will haveμ = 0
andσ = 1
.
P.S: I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is neither true nor correct.
See also: How and why to Standardize your data: A python tutorial
Example with code
from sklearn.preprocessing import StandardScaler
import numpy as np
# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(data)
[[0, 0],
[1, 0],
[0, 1],
[1, 1]])
print(scaled_data)
[[-1. -1.]
[ 1. -1.]
[-1. 1.]
[ 1. 1.]]
Verify that the mean of each feature (column) is 0:
scaled_data.mean(axis = 0)
array([0., 0.])
Verify that the std of each feature (column) is 1:
scaled_data.std(axis = 0)
array([1., 1.])
Appendix: The maths
UPDATE 08/2020: Concerning the input parameters with_mean
and with_std
to False
/True
, I have provided an answer here: StandardScaler difference between “with_std=False or True” and “with_mean=False or True”
The idea behind StandardScaler
is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.
In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).
How to calculate it:
You can read more here:
- http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#standardization-and-min-max-scaling
StandardScaler performs the task of Standardization. Usually a dataset contains variables that are different in scale. For e.g. an Employee dataset will contain AGE column with values on scale 20-70 and SALARY column with values on scale 10000-80000.
As these two columns are different in scale, they are Standardized to have common scale while building machine learning model.