scipy.cluster.hierarchy教程 [英] Tutorial for scipy.cluster.hierarchy

查看:429
本文介绍了scipy.cluster.hierarchy教程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何操作层次结构集群,但是文档太...技术性...了,我不明白它是如何工作的.

I'm trying to understand how to manipulate a hierarchy cluster but the documentation is too ... technical?... and I can't understand how it works.

是否有任何教程可以帮助我开始,逐步解释一些简单的任务?

Is there any tutorial that can help me to start with, explaining step by step some simple tasks?

假设我具有以下数据集:

Let's say I have the following data set:

a = np.array([[0,   0  ],
              [1,   0  ],
              [0,   1  ],
              [1,   1  ], 
              [0.5, 0  ],
              [0,   0.5],
              [0.5, 0.5],
              [2,   2  ],
              [2,   3  ],
              [3,   2  ],
              [3,   3  ]])

我可以轻松地进行层次聚类并绘制树状图:

I can easily do the hierarchy cluster and plot the dendrogram:

z = linkage(a)
d = dendrogram(z)

  • 现在,如何恢复特定群集?比方说树状图中元素[0,1,2,4,5,6]的那个?
  • 如何获取这些元素的值?
    • Now, how I can recover a specific cluster? Let's say the one with elements [0,1,2,4,5,6] in the dendrogram?
    • How I can get back the values of that elements?
    • 推荐答案

      层次聚类聚类(HAC)包括三个步骤:

      There are three steps in hierarchical agglomerative clustering (HAC):

      1. 量化数据(metric参数)
      2. 集群数据(method参数)
      3. 选择集群数量
      1. Quantify Data (metric argument)
      2. Cluster Data (method argument)
      3. Choose the number of clusters

      z = linkage(a)
      

      将完成前两个步骤.由于您未指定任何参数,因此使用标准值

      will accomplish the first two steps. Since you did not specify any parameters it uses the standard values

      1. metric = 'euclidean'
      2. method = 'single'
      1. metric = 'euclidean'
      2. method = 'single'

      因此,z = linkage(a)将为您提供a的单个链接的层次聚结聚类.这种群集是解决方案的一种层次结构.从该层次结构中,您可以获得有关数据结构的一些信息.您现在可以做的是:

      So z = linkage(a) will give you a single linked hierachical agglomerative clustering of a. This clustering is kind of a hierarchy of solutions. From this hierarchy you get some information about the structure of your data. What you might do now is:

      • 检查哪个metric是合适的,e. G. cityblockchebychev将量化您的数据(cityblockeuclideanchebychev对应于L1L2L_inf范数)
      • 检查methdos的不同属性/行为(例如singlecompleteaverage)
      • 检查如何确定群集数,例如G.通过阅读有关它的Wiki
      • 计算找到的解决方案(集群)的索引,例如剪影系数(通过该系数,您可以获得有关点/观测值与聚类分配给它的聚类的匹配程度的质量的反馈).不同的索引使用不同的条件来限定聚类.
      • Check which metric is appropriate, e. g. cityblock or chebychev will quantify your data differently (cityblock, euclidean and chebychev correspond to L1, L2, and L_inf norm)
      • Check the different properties / behaviours of the methdos (e. g. single, complete and average)
      • Check how to determine the number of clusters, e. g. by reading the wiki about it
      • Compute indices on the found solutions (clusterings) such as the silhouette coefficient (with this coefficient you get a feedback on the quality of how good a point/observation fits to the cluster it is assigned to by the clustering). Different indices use different criteria to qualify a clustering.

      从这里开始

      import numpy as np
      import scipy.cluster.hierarchy as hac
      import matplotlib.pyplot as plt
      
      
      a = np.array([[0.1,   2.5],
                    [1.5,   .4 ],
                    [0.3,   1  ],
                    [1  ,   .8 ],
                    [0.5,   0  ],
                    [0  ,   0.5],
                    [0.5,   0.5],
                    [2.7,   2  ],
                    [2.2,   3.1],
                    [3  ,   2  ],
                    [3.2,   1.3]])
      
      fig, axes23 = plt.subplots(2, 3)
      
      for method, axes in zip(['single', 'complete'], axes23):
          z = hac.linkage(a, method=method)
      
          # Plotting
          axes[0].plot(range(1, len(z)+1), z[::-1, 2])
          knee = np.diff(z[::-1, 2], 2)
          axes[0].plot(range(2, len(z)), knee)
      
          num_clust1 = knee.argmax() + 2
          knee[knee.argmax()] = 0
          num_clust2 = knee.argmax() + 2
      
          axes[0].text(num_clust1, z[::-1, 2][num_clust1-1], 'possible\n<- knee point')
      
          part1 = hac.fcluster(z, num_clust1, 'maxclust')
          part2 = hac.fcluster(z, num_clust2, 'maxclust')
      
          clr = ['#2200CC' ,'#D9007E' ,'#FF6600' ,'#FFCC00' ,'#ACE600' ,'#0099CC' ,
          '#8900CC' ,'#FF0000' ,'#FF9900' ,'#FFFF00' ,'#00CC01' ,'#0055CC']
      
          for part, ax in zip([part1, part2], axes[1:]):
              for cluster in set(part):
                  ax.scatter(a[part == cluster, 0], a[part == cluster, 1], 
                             color=clr[cluster])
      
          m = '\n(method: {})'.format(method)
          plt.setp(axes[0], title='Screeplot{}'.format(m), xlabel='partition',
                   ylabel='{}\ncluster distance'.format(m))
          plt.setp(axes[1], title='{} Clusters'.format(num_clust1))
          plt.setp(axes[2], title='{} Clusters'.format(num_clust2))
      
      plt.tight_layout()
      plt.show()
      

      送礼

      这篇关于scipy.cluster.hierarchy教程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆