使用sklearn在PCA中恢复解释名称_variance_ratio_的特征名称 [英] Recovering features names of explained_variance_ratio_ in PCA with sklearn

查看:315
本文介绍了使用sklearn在PCA中恢复解释名称_variance_ratio_的特征名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从使用scikit-learn完成的PCA中恢复,哪些功能被选择为相关.

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

具有IRIS数据集的经典示例.

A classic example with IRIS dataset.

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
df_norm = (df - df.mean()) / df.std()

# PCA
pca = PCA(n_components=2)
pca.fit_transform(df_norm.values)
print pca.explained_variance_ratio_

这将返回

In [42]: pca.explained_variance_ratio_
Out[42]: array([ 0.72770452,  0.23030523])

如何恢复数据集中哪两个特征允许这两个已解释的差异? 换句话说,我如何在iris.feature_names中获取此功能的索引?

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

预先感谢您的帮助.

推荐答案

此信息包含在pca属性:components_中.如文档所述,pca.components_输出一个[n_components, n_features]的数组,因此要了解组件与不同功能之间的线性关系,您必须:

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

注意:每个系数代表特定的一对组件和特征之间的相关性

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn.decomposition import PCA

# load dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# normalize data
from sklearn import preprocessing
data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

# Dump components relations with features:
print pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2'])

      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
PC-1           0.522372         -0.263355           0.581254          0.565611
PC-2          -0.372318         -0.925556          -0.021095         -0.065416

重要提示:作为补充说明,请注意PCA符号不会影响其解释,因为该符号不会影响每个组件中包含的差异.仅形成PCA尺寸的特征的相对符号很重要.实际上,如果再次运行PCA代码,则PCA尺寸可能会与符号相反.对此有一个直观的认识,请考虑一个向量及其在3-D空间中的负数-两者本质上都表示空间中的相同方向. 检查

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.

这篇关于使用sklearn在PCA中恢复解释名称_variance_ratio_的特征名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆