PCA分析后的特征/变量重要性 [英] Feature/Variable importance after a PCA analysis

查看:284
本文介绍了PCA分析后的特征/变量重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对原始数据集进行了PCA分析,并从PCA转换后的压缩数据集中,我还选择了要保留的PC数量(它们解释了几乎94%的差异).现在,我正在努力寻找对简化数据集中重要的原始特征. 在缩小尺寸后,我如何找出哪个功能很重要,哪些不属于其余主要组件? 这是我的代码:

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset. How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction? Here is my code:

from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)

此外,我还尝试对简化后的数据集执行聚类算法,但令我惊讶的是,该分数低于原始数据集.这怎么可能?

Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?

推荐答案



首先,我假设您调用features变量和not the samples/observations.在这种情况下,您可以通过创建一个在一个图中显示所有内容的biplot函数来执行以下操作.在此示例中,我使用的是虹膜数据.



First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example I am using the iris data.

在示例之前,请注意,将PCA用作特征选择工具时,基本思想是根据变量(载荷)的大小(从绝对值的最大到最小)选择变量.有关详细信息,请参见情节之后的最后一段.

概述:

PART1 :我将说明如何检查功能的重要性以及如何绘制双线图.

PART1: I explain how to check the importance of the features and how to plot a biplot.

PART2 :我将说明如何检查功能的重要性以及如何使用功能名称将其保存到熊猫数据框中.

PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()


使用Biplot可视化发生了什么

现在,每个特征的重要性都由特征向量中相应值的大小反映出来(强度越高,重要性越高)

首先让我们看看每台PC可以解释多少差异.

Let's see first what amount of variance does each PC explain.

pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 explains 72%PC2 23%.在一起,如果我们仅保留PC1和PC2,它们将解释95%.

PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.

现在,让我们找到最重要的功能.

Now, let's find the most important features.

print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

在这里,pca.components_的形状为[n_components, n_features].因此,通过查看第一行中的PC1(第一主成分):[0.52237162 0.26335492 0.58125401 0.56561105]],我们可以得出结论,feature 1, 3 and 4(或双图中的Var 1、3和4)是最重要的.

Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important.

总而言之,请查看与k个最大特征值相对应的特征向量分量的绝对值.在sklearn中,组件按explained_variance_排序.这些绝对值越大,特定功能对该主成分的贡献就越大.

To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.

重要的功能是影响更多组件的功能,因此,在组件上具有很大的绝对值/得分.

获取具有名称的PC上最重要的功能,并将其保存到 pandas数据框中,请使用以下方法:

To get the most important features on the PCs with names and save them into a pandas dataframe use this:

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

打印:

     0  1
 0  PC0  e
 1  PC1  d

因此,在PC1上,名为e的功能最为重要,在PC2上,名为d的功能最为重要.

So on the PC1 the feature named e is the most important and on PC2 the d.

这篇关于PCA分析后的特征/变量重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆