sklearn上的PCA-如何解释pca.components_ [英] PCA on sklearn - how to interpret pca.components_

查看:2842
本文介绍了sklearn上的PCA-如何解释pca.components_的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下简单代码在具有10个功能的数据帧上运行PCA:

I ran PCA on a data frame with 10 features using this simple code:

pca = PCA()
fit = pca.fit(dfPca)

pca.explained_variance_ratio_的结果显示:

array([  5.01173322e-01,   2.98421951e-01,   1.00968655e-01,
         4.28813755e-02,   2.46887288e-02,   1.40976609e-02,
         1.24905823e-02,   3.43255532e-03,   1.84516942e-03,
         4.50314168e-16])

我相信这意味着第一台PC解释了52%的方差,第二台PC解释了29%的方格,依此类推...

I believe that means that the first PC explains 52% of the variance, the second component explains 29% and so on...

我不理解的是pca.components_的输出.如果我执行以下操作:

What I dont undestand is the output of pca.components_. If I do the following:

df = pd.DataFrame(pca.components_, columns=list(dfPca.columns))

我得到下面的数据框,其中每一行都是主要成分. 我想了解的是如何解释该表.我知道,如果我对每个组件上的所有特征求平方并求和,我将得到1,但是PC1上的-0.56是什么意思?它能说明特征E",因为它是解释52%差异的组件中的最高值?

I get the data frame bellow where each line is a principal component. What I'd like to understand is how to interpret that table. I know that if I square all the features on each component and sum them I get 1, but what does the -0.56 on PC1 mean? Dos it tell something about "Feature E" since it is the highest magnitude on a component that explains 52% of the variance?

谢谢

推荐答案

术语:首先,PCA的结果通常根据组成部分分数(有时称为因子分数)来讨论.转换为对应于特定数据点的变量值)和负载(应将每个标准化原始变量乘以权重以获得组件得分的权重).

Terminology: First of all, the results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

PART1 :我将说明如何检查功能的重要性以及如何绘制双线图.

PART1: I explain how to check the importance of the features and how to plot a biplot.

PART2 :我将说明如何检查功能的重要性以及如何使用功能名称将其保存到熊猫数据框中.

PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.

在您的情况下,功能E的值-0.56是该功能在PC1上的得分. 此值告诉我们功能对PC(在我们的情况下为PC1)有多大影响.

In your case, the value -0.56 for Feature E is the score of this feature on the PC1. This value tells us 'how much' the feature influences the PC (in our case the PC1).

因此,绝对值越大,对主成分的影响越大.

So the higher the value in absolute value, the higher the influence on the principal component.

进行PCA分析后,人们通常会绘制已知的基准图"以查看N维(本例中为2个)中的转换特征和原始变量(特征).

After performing the PCA analysis, people usually plot the known 'biplot' to see the transformed features in the N dimensions (2 in our case) and the original variables (features).

我写了一个函数来画这个.

I wrote a function to plot this.

示例使用虹膜数据:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca. components_) 
plt.show()


结果

重要的功能是影响更多组件的功能,因此对组件具有很大的绝对价值.

获取具有名称的PC上最重要的功能,并将其保存到 pandas数据框中,请使用以下方法:

TO get the most important features on the PCs with names and save them into a pandas dataframe use this:

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

打印:

     0  1
 0  PC0  e
 1  PC1  d

因此,在PC1上,名为e的功能最为重要,在PC2上,名为d的功能最为重要.

So on the PC1 the feature named e is the most important and on PC2 the d.

这篇关于sklearn上的PCA-如何解释pca.components_的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆