scikit-learn:查找有助于每个KMeans集群的功能 [英] scikit-learn: Finding the features that contribute to each KMeans cluster

查看:73
本文介绍了scikit-learn:查找有助于每个KMeans集群的功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您要使用10个功能来创建3个群集。有没有办法查看每个功能对每个群集的贡献程度?

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?

我想说的是群集k1,功能1,4,6是主要功能,而群集k2的主要功能是2,5,7。

What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.

这是我使用的基本设置:

This is the basic setup of what I am using:

k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_


推荐答案

您可以使用


PCA可以通过数据协方差(或相关)矩阵的特征值分解或数据矩阵的奇异值分解来完成,通常是在平均居中后(并归一化或使用Z分数)每个属性的数据矩阵。 PCA的结果通常以组件评分(有时称为因子评分)(对应于特定数据点的转换变量值)和负载(每个标准化原始变量应乘以权重以获得组件评分的权重)的形式进行讨论)。

PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

一些要点:


  • 特征值反映了相应部分解释的方差部分。假设我们有4个特征值分别为 1、4、1、2 的特征。这些是相应的解释的差异。向量。第二个值属于第一主成分,因为它解释了总体方差的50%,最后一个值属于第二主成分,解释了总体方差的25%。

  • 特征向量是组件的线性组合。给出功能的权重,以便您知道哪个功能具有高/低影响。

  • 使用基于相关矩阵的PCA 而不是经验协方差矩阵,如果特征值相差很大(幅度)。

  • the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
  • the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
  • use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).

  • 对整个数据集执行PCA(这就是下面的功能所做的工作)

    • 采用具有观察和特征的矩阵

    • 将其居中于其平均值(所有观测值中特征值的平均值)

    • 计算经验协方差矩阵(例如 np.cov )或相关性(请参见上文)

    • 进行分解

    • 按特征值对特征值和特征向量进行排序,以获得影响最大的分量

    • 在原始数据上使用组件

    • do PCA on entire dataset (that's what the function below does)
      • take matrix with observations and features
      • center it to its average (average of feature values among all observations)
      • compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
      • perform decomposition
      • sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
      • use components on original data

      您需要将numpy导入为np ,将 scipy导入为sp 。它使用 sp.linalg.eigh 进行分解。您可能还需要检查 scikit分解模块

      You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.

      PCA是对数据矩阵执行的,其中观察项(对象)在行中,特征在列中。

      PCA is performed on a data matrix with observations (objects) in rows and features in columns.

      def dim_red_pca(X, d=0, corr=False):
          r"""
          Performs principal component analysis.
      
          Parameters
          ----------
          X : array, (n, d)
              Original observations (n observations, d features)
      
          d : int
              Number of principal components (default is ``0`` => all components).
      
          corr : bool
              If true, the PCA is performed based on the correlation matrix.
      
          Notes
          -----
          Always all eigenvalues and eigenvectors are returned,
          independently of the desired number of components ``d``.
      
          Returns
          -------
          Xred : array, (n, m or d)
              Reduced data matrix
      
          e_values : array, (m)
              The eigenvalues, sorted in descending manner.
      
          e_vectors : array, (n, m)
              The eigenvectors, sorted corresponding to eigenvalues.
      
          """
          # Center to average
          X_ = X-X.mean(0)
          # Compute correlation / covarianz matrix
          if corr:
              CO = np.corrcoef(X_.T)
          else:
              CO = np.cov(X_.T)
          # Compute eigenvalues and eigenvectors
          e_values, e_vectors = sp.linalg.eigh(CO)
      
          # Sort the eigenvalues and the eigenvectors descending
          idx = np.argsort(e_values)[::-1]
          e_vectors = e_vectors[:, idx]
          e_values = e_values[idx]
          # Get the number of desired dimensions
          d_e_vecs = e_vectors
          if d > 0:
              d_e_vecs = e_vectors[:, :d]
          else:
              d = None
          # Map principal components to original data
          LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
          return LIN[:, :d], e_values, e_vectors
      


      示例用法


      这是一个示例脚本,该脚本利用给定的函数并使用 scipy.cluster.vq.kmeans2 用于集群。注意结果随每次运行而变化。

      Sample usage

      Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.

      import numpy as np
      import scipy as sp
      from scipy.cluster.vq import kmeans2
      import matplotlib.pyplot as plt
      
      SN = np.array([ [1.325, 1.000, 1.825, 1.750],
                      [2.000, 1.250, 2.675, 1.750],
                      [3.000, 3.250, 3.000, 2.750],
                      [1.075, 2.000, 1.675, 1.000],
                      [3.425, 2.000, 3.250, 2.750],
                      [1.900, 2.000, 2.400, 2.750],
                      [3.325, 2.500, 3.000, 2.000],
                      [3.000, 2.750, 3.075, 2.250],
                      [2.075, 1.250, 2.000, 2.250],
                      [2.500, 3.250, 3.075, 2.250],
                      [1.675, 2.500, 2.675, 1.250],
                      [2.075, 1.750, 1.900, 1.500],
                      [1.750, 2.000, 1.150, 1.250],
                      [2.500, 2.250, 2.425, 2.500],
                      [1.675, 2.750, 2.000, 1.250],
                      [3.675, 3.000, 3.325, 2.500],
                      [1.250, 1.500, 1.150, 1.000]], dtype=float)
          
      clust,labels_ = kmeans2(SN,3)    # cluster with 3 random initial clusters
      # PCA on orig. dataset 
      # Xred will have only 2 columns, the first two princ. comps.
      # evals has shape (4,) and evecs (4,4). We need all eigenvalues 
      # to determine the portion of variance
      Xred, evals, evecs = dim_red_pca(SN,2)   
      
      xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
      ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
      # plot the clusters, each set separately
      plt.figure()    
      ax = plt.gca()
      scatterHs = []
      clr = ['r', 'b', 'k']
      for cluster in set(labels_):
          scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1], 
                         color=clr[cluster], label='Cluster {}'.format(cluster)))
      plt.legend(handles=scatterHs,loc=4)
      plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
      # plot also the eigenvectors for deriving the influence of each feature
      fig, ax = plt.subplots(2,1)
      ax[0].bar([1, 2, 3, 4],evecs[0])
      plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
      ax[1].bar([1, 2, 3, 4],evecs[1])
      plt.setp(ax[1], xlabel='Features', ylabel='Weight')
      


      输出


      特征向量显示组件每个特征的权重

      Output

      The eigenvectors show the weighting of each feature for the component

      < img src = https://i.stack.imgur.com/9yndq.png alt =在此处输入图片描述 />

      让我们看一下群集零,红色群集。我们将首先对第一个组件感兴趣,因为它可以解释大约3/4的分布。红色簇位于第一个组件的上部。所有观察都产生相当高的值。这是什么意思?现在来看一下我们第一眼看到的第一个组件的线性组合,第二个组件(对于这个组件)并不重要。第一个和第四个特征权重最高,第三个特征为负分。这意味着-由于所有红色顶点在第一台PC上的得分都很高,因此这些顶点在第一个和最后一个要素中的得分较高,同时,它们在得分上的得分较低第三项功能。

      Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.

      关于第二项功能,我们可以看一下第二台PC。但是,请注意,总体影响要小得多,因为与第一台PC的约74%相比,该组件仅能解释大约16%的差异。

      Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

      这篇关于scikit-learn:查找有助于每个KMeans集群的功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆