如何在scipy/matplotlib中绘制和注释分层聚类树状图 [英] how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib

查看:562
本文介绍了如何在scipy/matplotlib中绘制和注释分层聚类树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scipy中的dendrogram来绘制使用matplotlib的层次聚类,如下所示:

I'm using dendrogram from scipy to plot hierarchical clustering using matplotlib as follows:

mat = array([[1, 0.5, 0.9],
             [0.5, 1, -0.5],
             [0.9, -0.5, 1]])
plt.subplot(1,2,1)
plt.title("mat")
dist_mat = mat
linkage_matrix = linkage(dist_mat,
                         "single")
print "linkage2:"
print linkage(1-dist_mat, "single")
dendrogram(linkage_matrix,
           color_threshold=1,
           labels=["a", "b", "c"],
           show_leaf_counts=True)
plt.subplot(1,2,2)
plt.title("1 - mat")
dist_mat = 1 - mat
linkage_matrix = linkage(dist_mat,
                         "single")
dendrogram(linkage_matrix,
           color_threshold=1,
           labels=["a", "b", "c"],
           show_leaf_counts=True)

我的问题是:首先,为什么mat1-mat在这里给出相同的聚类?其次,如何使用dendrogram注释沿树的每个分支的距离,以便可以比较节点对之间的距离?

My questions are: first, why does mat and 1-mat give identical clusterings here? and second, how can I annotate the distance along each branch of the tree using dendrogram so that the distances between pairs of nodes can be compared?

最后看来,show_leaf_counts标志被忽略了,有没有办法打开它,以便显示每个类中的对象数量?谢谢.

finally it seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown? thanks.

推荐答案

linkage()的输入是n x m数组,表示数组中的n个点. m维空间或包含 condensed 距离矩阵.在您的示例中,mat为3 x 3,因此您正在聚类 三个3-d点.聚类是基于这些点之间的距离.

The input to linkage() is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example, mat is 3 x 3, so you are clustering three 3-d points. Clustering is based on the distance between these points.

为什么mat和1-mat在这里给出相同的聚类?

数组mat1-mat产生相同的聚类,因为聚类 基于点之间的距离,并且都不基于反射(-mat) 整个数据集的转换(mat + offset)也不会改变相对 点之间的距离.

The arrays mat and 1-mat produce the same clustering because the clustering is based on distances between the points, and neither a reflection (-mat) nor a translation (mat + offset) of the entire data set change the relative distances between the points.

如何使用树状图注释沿树的每个分支的距离,以便可以比较节点对之间的距离?

在下面的代码中,我 展示如何使用树状图返回的数据标记水平 图中具有相应距离的线段.关联的值 用键icoorddcoord给出每个的x和y坐标 该图的三段倒U型.在augmented_dendrogram中,此数据 用于添加每个水平线的距离(即y值)的标签 树状图中的线段.

In the code below, I show how you can use the data returned by dendrogram to label the horizontal segments of the diagram with the corresponding distance. The values associated with the keys icoord and dcoord give the x and y coordinates of each three-segment inverted-U of the figure. In augmented_dendrogram this data is used to add a label of the distance (i.e. y value) of each horizontal line segment in dendrogram.

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt


def augmented_dendrogram(*args, **kwargs):

    ddata = dendrogram(*args, **kwargs)

    if not kwargs.get('no_plot', False):
        for i, d in zip(ddata['icoord'], ddata['dcoord']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            plt.plot(x, y, 'ro')
            plt.annotate("%.3g" % y, (x, y), xytext=(0, -8),
                         textcoords='offset points',
                         va='top', ha='center')

    return ddata

对于您的mat阵列,增强树状图为

For your mat array, the augmented dendrogram is

因此点'a'和'c'相距1.01个单位,而点'b'相距1.57个单位 群集['a','c'].

So point 'a' and 'c' are 1.01 units apart, and point 'b' is 1.57 units from the cluster ['a', 'c'].

似乎show_leaf_counts标志已被忽略,有没有办法将其打开 这样就可以显示每个类中的对象数量?

It seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown?

标志show_leaf_counts仅在并非所有原始数据时都适用 点显示为叶子.例如,当trunc_mode = "lastp"时, 仅显示最后的p个节点.

The flag show_leaf_counts only applies when not all the original data points are shown as leaves. For example, when trunc_mode = "lastp", only the last p nodes are show.

这里有一个100分的例子:

Here's an example with 100 points:

import numpy as np
from scipy.cluster.hierarchy import linkage
import matplotlib.pyplot as plt
from augmented_dendrogram import augmented_dendrogram


# Generate a random sample of `n` points in 2-d.
np.random.seed(12312)
n = 100
x = np.random.multivariate_normal([0, 0], np.array([[4.0, 2.5], [2.5, 1.4]]),
                                  size=(n,))

plt.figure(1, figsize=(6, 5))
plt.clf()
plt.scatter(x[:, 0], x[:, 1])
plt.axis('equal')
plt.grid(True)

linkage_matrix = linkage(x, "single")

plt.figure(2, figsize=(10, 4))
plt.clf()

plt.subplot(1, 2, 1)
show_leaf_counts = False
ddata = augmented_dendrogram(linkage_matrix,
               color_threshold=1,
               p=6,
               truncate_mode='lastp',
               show_leaf_counts=show_leaf_counts,
               )
plt.title("show_leaf_counts = %s" % show_leaf_counts)

plt.subplot(1, 2, 2)
show_leaf_counts = True
ddata = augmented_dendrogram(linkage_matrix,
               color_threshold=1,
               p=6,
               truncate_mode='lastp',
               show_leaf_counts=show_leaf_counts,
               )
plt.title("show_leaf_counts = %s" % show_leaf_counts)

plt.show()

这些是数据集中的点:

对于p=6trunc_mode="lastp"dendrogram仅显示顶部" 树状图.下面显示了show_leaf_counts的效果.

With p=6 and trunc_mode="lastp", dendrogram only shows the "top" of the dendrogram. The following shows the effect of show_leaf_counts.

这篇关于如何在scipy/matplotlib中绘制和注释分层聚类树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆