PCA 分析后的特征/变量重要性 [英] Feature/Variable importance after a PCA analysis

查看:65
本文介绍了PCA 分析后的特征/变量重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对原始数据集进行了 PCA 分析,并从 PCA 转换的压缩数据集中选择了我想要保留的 PC 数量(它们解释了几乎 94% 的差异).现在我正在努力识别在减少的数据集中很重要的原始特征.在降维后,如何找出哪些特征是重要的,哪些不在剩余的主成分中?这是我的代码:

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset. How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction? Here is my code:

from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)

此外,我还尝试在缩减的数据集上执行聚类算法,但令我惊讶的是,分数低于原始数据集.这怎么可能?

Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?

推荐答案


首先,我假设您将features 称为变量而不是而不是样本/观察.在这种情况下,您可以通过创建一个在一个图中显示所有内容的 biplot 函数来执行以下操作.在本例中,我使用的是虹膜数据.


First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.

在示例之前,请注意使用PCA作为特征选择工具的基本思想是根据其系数(载荷)的大小(绝对值从最大到最小)来选择变量.有关更多详细信息,请参阅我在情节后面的最后一段.

Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.

概述:

PART1:我解释了如何检查特征的重要性以及如何绘制双标图.

PART1: I explain how to check the importance of the features and how to plot a biplot.

PART2:我解释了如何检查特征的重要性以及如何使用特征名称将它们保存到 Pandas 数据框中.

PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)    

pca = PCA()
x_new = pca.fit_transform(X)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()


使用双标图可视化正在发生的事情

现在,每个特征的重要性通过特征向量中对应值的大小来体现(幅度越大-重要性越高)

让我们先看看每台 PC 解释了多少方差.

Let's see first what amount of variance does each PC explain.

pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]

PC1 解释了 72%PC2 23%.总之,如果我们只保留 PC1 和 PC2,它们可以解释 95%.

PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.

现在,让我们找出最重要的功能.

Now, let's find the most important features.

print(abs( pca.components_ ))

[[0.52237162 0.26335492 0.58125401 0.56561105]
 [0.37231836 0.92555649 0.02109478 0.06541577]
 [0.72101681 0.24203288 0.14089226 0.6338014 ]
 [0.26199559 0.12413481 0.80115427 0.52354627]]

这里,pca.components_ 有形状 [n_components, n_features].因此,通过查看第一行的 PC1(第一主成分):[0.52237162 0.26335492 0.58125401 0.56561105]] 我们可以得出结论 feature 1, 3和 4(或双标图中的 Var 1、3 和 4)是最重要的. 这也从双标图中清晰可见(这就是为什么我们经常使用此图在视觉上总结信息的原因)方式).

Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).

总而言之,看k个最大特征值对应的特征向量分量的绝对值.在 sklearn 中,组件按 explained_variance_ 排序.这些绝对值越大,特定特征对该主成分的贡献就越大.

To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.

重要的特征是那些对组件影响更大的特征,因此对组件具有很大的绝对值/分数.

获取带有名称的 PC 上最重要的功能并将它们保存到 pandas 数据框中,请使用以下命令:

To get the most important features on the PCs with names and save them into a pandas dataframe use this:

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

打印:

     0  1
 0  PC0  e
 1  PC1  d

所以在 PC1 上名为 e 的特性是最重要的,而在 PC2 上名为 d.

So on the PC1 the feature named e is the most important and on PC2 the d.

这里的文章也不错:https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-feature-use-itimportant-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f

这篇关于PCA 分析后的特征/变量重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆