使用sklearn在PCA中恢复explain_variance_ratio_的特征名称 [英] Recovering features names of explained_variance_ratio_ in PCA with sklearn
问题描述
我正在尝试从使用 scikit-learn 完成的 PCA 中恢复,哪些特征被选择为相关.
I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.
IRIS 数据集的经典示例.
A classic example with IRIS dataset.
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
df_norm = (df - df.mean()) / df.std()
# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_
返回
In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452, 0.23030523])
我怎样才能恢复哪两个特征允许这两个数据集之间的解释差异?换个说法,我怎样才能在 iris.feature_names 中得到这个特征的索引?
How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?
In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
预先感谢您的帮助.
推荐答案
此信息包含在 pca
属性中:components_
.如文档中所述,pca.components_
输出一个 [n_components, n_features]
数组,因此要了解组件与不同特征的线性关系,您必须:
This information is included in the pca
attribute: components_
. As described in the documentation, pca.components_
outputs an array of [n_components, n_features]
, so to get how components are linearly related with the different features you have to:
注意:每个系数代表特定组件和特征对之间的相关性
Note: each coefficient represents the correlation between a particular pair of component and feature
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)
# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)
# Dump components relations with features:
print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
PC-1 0.522372 -0.263355 0.581254 0.565611
PC-2 -0.372318 -0.925556 -0.021095 -0.065416
重要提示: 作为旁注,请注意 PCA 符号不会影响其解释,因为该符号不会影响每个组件中包含的方差.只有形成 PCA 维度的特征的相对符号才是重要的.事实上,如果您再次运行 PCA 代码,您可能会得到符号反转的 PCA 尺寸.为了直观地了解这一点,请考虑 3-D 空间中的向量及其负数 - 两者本质上都表示空间中的相同方向.检查 这篇文章 以供进一步参考.
IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.
这篇关于使用sklearn在PCA中恢复explain_variance_ratio_的特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!