Recovering features names of explained_variance_ratio_ in PCA with sklearn

Solution 1:

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.

Solution 2:

Edit: as others have commented, you may get same values from .components_ attribute.

Each principal component is a linear combination of the original variables:

pca-coef

where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the transform method:

>>> i = np.identity(df.shape[1])  # identity matrix
>>> i
array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

>>> coef = pca.transform(i)
>>> coef
array([[ 0.5224, -0.3723],
       [-0.2634, -0.9256],
       [ 0.5813, -0.0211],
       [ 0.5656, -0.0654]])

Each column of the coef matrix above shows the weights in the linear combination which obtains corresponding principal component:

>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)
                    PC-1   PC-2
sepal length (cm)  0.522 -0.372
sepal width (cm)  -0.263 -0.926
petal length (cm)  0.581 -0.021
petal width (cm)   0.566 -0.065

[4 rows x 2 columns]

For example, above shows that the second principal component (PC-2) is mostly aligned with sepal width, which has the highest weight of 0.926 in absolute value;

Since the data were normalized, you can confirm that the principal components have variance 1.0 which is equivalent to each coefficient vector having norm 1.0:

>>> np.linalg.norm(coef,axis=0)
array([ 1.,  1.])

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values))
True

Note that we need to use numpy.allclose instead of regular equality operator, because of floating point precision error.

Recovering features names of explained_variance_ratio_ in PCA with sklearn

Solution 1:

Solution 2:

Related

Recent Posts