如何在 scipy/matplotlib 中绘制和注释层次聚类树状图 [英] how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib

查看:84
本文介绍了如何在 scipy/matplotlib 中绘制和注释层次聚类树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 scipy 中的 dendrogram 使用 matplotlib 绘制层次聚类,如下所示:

I'm using dendrogram from scipy to plot hierarchical clustering using matplotlib as follows:

mat = array([[1, 0.5, 0.9],
             [0.5, 1, -0.5],
             [0.9, -0.5, 1]])
plt.subplot(1,2,1)
plt.title("mat")
dist_mat = mat
linkage_matrix = linkage(dist_mat,
                         "single")
print "linkage2:"
print linkage(1-dist_mat, "single")
dendrogram(linkage_matrix,
           color_threshold=1,
           labels=["a", "b", "c"],
           show_leaf_counts=True)
plt.subplot(1,2,2)
plt.title("1 - mat")
dist_mat = 1 - mat
linkage_matrix = linkage(dist_mat,
                         "single")
dendrogram(linkage_matrix,
           color_threshold=1,
           labels=["a", "b", "c"],
           show_leaf_counts=True)

我的问题是:首先,为什么 mat1-mat 在这里给出相同的聚类?其次,如何使用 dendrogram 注释沿着树的每个分支的距离,以便可以比较节点对之间的距离?

My questions are: first, why does mat and 1-mat give identical clusterings here? and second, how can I annotate the distance along each branch of the tree using dendrogram so that the distances between pairs of nodes can be compared?

最后似乎 show_leaf_counts 标志被忽略了,有没有办法打开它以便显示每个类中的对象数量?谢谢.

finally it seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown? thanks.

推荐答案

linkage() 的输入是一个 n x m 数组,代表 n 个点m 维空间,或包含 condensed<的一维数组/em> 距离矩阵.在您的示例中,mat 是 3 x 3,因此您正在聚类三个 3-d 点.聚类基于这些点之间的距离.

The input to linkage() is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example, mat is 3 x 3, so you are clustering three 3-d points. Clustering is based on the distance between these points.

为什么 mat 和 1-mat 在这里给出相同的聚类?

数组 mat1-mat 产生相同的聚类,因为聚类基于点之间的距离,而不是反射(-mat)整个数据集的转换(mat + offset)也不会改变相对点之间的距离.

The arrays mat and 1-mat produce the same clustering because the clustering is based on distances between the points, and neither a reflection (-mat) nor a translation (mat + offset) of the entire data set change the relative distances between the points.

如何使用树状图标注沿树的每个分支的距离,以便比较节点对之间的距离?

在下面的代码中,我展示如何使用树状图返回的数据来标记水平具有相应距离的图段.相关的值使用键 icoorddcoord 给出每个的 x 和 y 坐标图的三段倒U.在 augmented_dendrogram 这个数据用于添加每个水平距离(即y值)的标签树状图中的线段.

In the code below, I show how you can use the data returned by dendrogram to label the horizontal segments of the diagram with the corresponding distance. The values associated with the keys icoord and dcoord give the x and y coordinates of each three-segment inverted-U of the figure. In augmented_dendrogram this data is used to add a label of the distance (i.e. y value) of each horizontal line segment in dendrogram.

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt


def augmented_dendrogram(*args, **kwargs):

    ddata = dendrogram(*args, **kwargs)

    if not kwargs.get('no_plot', False):
        for i, d in zip(ddata['icoord'], ddata['dcoord']):
            x = 0.5 * sum(i[1:3])
            y = d[1]
            plt.plot(x, y, 'ro')
            plt.annotate("%.3g" % y, (x, y), xytext=(0, -8),
                         textcoords='offset points',
                         va='top', ha='center')

    return ddata

对于您的 mat 数组,增强树状图是

For your mat array, the augmented dendrogram is

所以点'a'和'c'相距1.01个单位,而点'b'相距1.57个单位集群 ['a', 'c'].

So point 'a' and 'c' are 1.01 units apart, and point 'b' is 1.57 units from the cluster ['a', 'c'].

似乎忽略了show_leaf_counts 标志,有没有办法打开它以便显示每个类中的对象数量?

It seems that show_leaf_counts flag is ignored, is there a way to turn it on so that the number of objects in each class is shown?

show_leaf_counts 标志仅适用于并非所有原始数据点显示为叶子.例如,当 trunc_mode = "lastp" 时,仅显示最后一个 p 节点.

The flag show_leaf_counts only applies when not all the original data points are shown as leaves. For example, when trunc_mode = "lastp", only the last p nodes are show.

这是一个 100 分的例子:

Here's an example with 100 points:

import numpy as np
from scipy.cluster.hierarchy import linkage
import matplotlib.pyplot as plt
from augmented_dendrogram import augmented_dendrogram


# Generate a random sample of `n` points in 2-d.
np.random.seed(12312)
n = 100
x = np.random.multivariate_normal([0, 0], np.array([[4.0, 2.5], [2.5, 1.4]]),
                                  size=(n,))

plt.figure(1, figsize=(6, 5))
plt.clf()
plt.scatter(x[:, 0], x[:, 1])
plt.axis('equal')
plt.grid(True)

linkage_matrix = linkage(x, "single")

plt.figure(2, figsize=(10, 4))
plt.clf()

plt.subplot(1, 2, 1)
show_leaf_counts = False
ddata = augmented_dendrogram(linkage_matrix,
               color_threshold=1,
               p=6,
               truncate_mode='lastp',
               show_leaf_counts=show_leaf_counts,
               )
plt.title("show_leaf_counts = %s" % show_leaf_counts)

plt.subplot(1, 2, 2)
show_leaf_counts = True
ddata = augmented_dendrogram(linkage_matrix,
               color_threshold=1,
               p=6,
               truncate_mode='lastp',
               show_leaf_counts=show_leaf_counts,
               )
plt.title("show_leaf_counts = %s" % show_leaf_counts)

plt.show()

这些是数据集中的点:

使用 p=6trunc_mode="lastp"dendrogram 只显示顶部"的树状图.下面是show_leaf_counts的效果.

With p=6 and trunc_mode="lastp", dendrogram only shows the "top" of the dendrogram. The following shows the effect of show_leaf_counts.

这篇关于如何在 scipy/matplotlib 中绘制和注释层次聚类树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆