使用sklearn在PCA中恢复explain_variance_ratio_的特征名称 [英] Recovering features names of explained_variance_ratio_ in PCA with sklearn

查看:48
本文介绍了使用sklearn在PCA中恢复explain_variance_ratio_的特征名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从使用 scikit-learn 完成的 PCA 中恢复,哪些特征被选择为相关.

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

IRIS 数据集的经典示例.

A classic example with IRIS dataset.

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_

返回

In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452,  0.23030523])

我怎样才能恢复哪两个特征允许这两个数据集之间的解释差异?换个说法,我怎样才能在 iris.feature_names 中得到这个特征的索引?

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

预先感谢您的帮助.

推荐答案

此信息包含在 pca 属性中:components_.如文档中所述,pca.components_ 输出一个 [n_components, n_features] 数组,因此要了解组件与不同特征的线性关系,您必须:

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

注意:每个系数代表特定组件和特征对之间的相关性

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

重要提示: 作为旁注,请注意 PCA 符号不会影响其解释,因为该符号不会影响每个组件中包含的方差.只有形成 PCA 维度的特征的相对符号才是重要的.事实上,如果您再次运行 PCA 代码,您可能会得到符号反转的 PCA 尺寸.为了直观地了解这一点,请考虑 3-D 空间中的向量及其负数 - 两者本质上都表示空间中的相同方向.检查 这篇文章 以供进一步参考.

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.

这篇关于使用sklearn在PCA中恢复explain_variance_ratio_的特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆