使用sklearn在PCA中恢复解释名称_variance_ratio_的特征名称 [英] Recovering features names of explained_variance_ratio_ in PCA with sklearn
问题描述
我正在尝试从使用scikit-learn完成的PCA中恢复,哪些功能被选择为相关.
I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.
具有IRIS数据集的经典示例.
A classic example with IRIS dataset.
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
df_norm = (df - df.mean()) / df.std()
# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_
这将返回
In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452, 0.23030523])
如何恢复数据集中哪两个特征允许这两个已解释的差异? 换句话说,我如何在iris.feature_names中获取此功能的索引?
How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?
In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
预先感谢您的帮助.
推荐答案
此信息包含在pca
属性:components_
中.如文档所述,pca.components_
输出一个[n_components, n_features]
的数组,因此要了解组件与不同功能之间的线性关系,您必须:
This information is included in the pca
attribute: components_
. As described in the documentation, pca.components_
outputs an array of [n_components, n_features]
, so to get how components are linearly related with the different features you have to:
注意:每个系数代表特定的一对组件和特征之间的相关性
Note: each coefficient represents the correlation between a particular pair of component and feature
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA
# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)
# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)
# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
PC-1 0.522372 -0.263355 0.581254 0.565611
PC-2 -0.372318 -0.925556 -0.021095 -0.065416
重要提示:作为补充说明,请注意PCA符号不会影响其解释,因为该符号不会影响每个组件中包含的差异.仅形成PCA尺寸的特征的相对符号很重要.实际上,如果再次运行PCA代码,则PCA尺寸可能会与符号相反.对此有一个直观的认识,请考虑一个向量及其在3-D空间中的负数-两者本质上都表示空间中的相同方向. 检查
IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.
这篇关于使用sklearn在PCA中恢复解释名称_variance_ratio_的特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!