如何在 scipy/matplotlib 中绘制和注释层次聚类树状图 [英] how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib
问题描述
我使用 scipy
中的 dendrogram
使用 matplotlib
绘制层次聚类,如下所示:
I'm using dendrogram
from scipy
to plot hierarchical clustering using matplotlib
as follows:
mat = array([[1, 0.5, 0.9],
[0.5, 1, -0.5],
[0.9, -0.5, 1]])
plt.subplot(1,2,1)
plt.title("mat")
dist_mat = mat
linkage_matrix = linkage(dist_mat,
"single")
print "linkage2:"
print linkage(1-dist_mat, "single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
plt.subplot(1,2,2)
plt.title("1 - mat")
dist_mat = 1 - mat
linkage_matrix = linkage(dist_mat,
"single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
我的问题是:首先,为什么 mat
和 1-mat
在这里给出相同的聚类?其次,如何使用 dendrogram
注释沿着树的每个分支的距离,以便可以比较节点对之间的距离?
My questions are: first, why does mat
and 1-mat
give identical clusterings here? and second, how can I annotate the distance along each branch of the tree using dendrogram
so that the distances between pairs of nodes can be compared?
最后似乎 show_leaf_counts
标志被忽略了,有没有办法打开它以便显示每个类中的对象数量?谢谢.
finally it seems that show_leaf_counts
flag is ignored, is there a way to turn it on so that the number of objects in each class is shown? thanks.
推荐答案
linkage()
的输入是一个 n x m 数组,代表 n 个点m 维空间,或包含 condensed<的一维数组/em> 距离矩阵.在您的示例中,mat
是 3 x 3,因此您正在聚类三个 3-d 点.聚类基于这些点之间的距离.
The input to linkage()
is either an n x m array, representing n points in
m-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example, mat
is 3 x 3, so you are clustering
three 3-d points. Clustering is based on the distance between these points.
为什么 mat 和 1-mat 在这里给出相同的聚类?
数组 mat
和 1-mat
产生相同的聚类,因为聚类基于点之间的距离,而不是反射(-mat
)整个数据集的转换(mat + offset
)也不会改变相对点之间的距离.
The arrays mat
and 1-mat
produce the same clustering because the clustering
is based on distances between the points, and neither a reflection (-mat
)
nor a translation (mat + offset
) of the entire data set change the relative
distances between the points.
如何使用树状图标注沿树的每个分支的距离,以便比较节点对之间的距离?
在下面的代码中,我展示如何使用树状图返回的数据来标记水平具有相应距离的图段.相关的值使用键 icoord
和 dcoord
给出每个的 x 和 y 坐标图的三段倒U.在 augmented_dendrogram
这个数据用于添加每个水平距离(即y值)的标签树状图中的线段.
In the code below, I
show how you can use the data returned by dendrogram to label the horizontal
segments of the diagram with the corresponding distance. The values associated
with the keys icoord
and dcoord
give the x and y coordinates of each
three-segment inverted-U of the figure. In augmented_dendrogram
this data
is used to add a label of the distance (i.e. y value) of each horizontal
line segment in dendrogram.
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
def augmented_dendrogram(*args, **kwargs):
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
for i, d in zip(ddata['icoord'], ddata['dcoord']):
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')
plt.annotate("%.3g" % y, (x, y), xytext=(0, -8),
textcoords='offset points',
va='top', ha='center')
return ddata
对于您的 mat
数组,增强树状图是
For your mat
array, the augmented dendrogram is
所以点'a'和'c'相距1.01个单位,而点'b'相距1.57个单位集群 ['a', 'c'].
So point 'a' and 'c' are 1.01 units apart, and point 'b' is 1.57 units from the cluster ['a', 'c'].
似乎忽略了show_leaf_counts
标志,有没有办法打开它以便显示每个类中的对象数量?
It seems that show_leaf_counts
flag is ignored, is there a way to turn it on
so that the number of objects in each class is shown?
show_leaf_counts
标志仅适用于并非所有原始数据点显示为叶子.例如,当 trunc_mode = "lastp"
时,仅显示最后一个 p
节点.
The flag show_leaf_counts
only applies when not all the original data
points are shown as leaves. For example, when trunc_mode = "lastp"
,
only the last p
nodes are show.
这是一个 100 分的例子:
Here's an example with 100 points:
import numpy as np
from scipy.cluster.hierarchy import linkage
import matplotlib.pyplot as plt
from augmented_dendrogram import augmented_dendrogram
# Generate a random sample of `n` points in 2-d.
np.random.seed(12312)
n = 100
x = np.random.multivariate_normal([0, 0], np.array([[4.0, 2.5], [2.5, 1.4]]),
size=(n,))
plt.figure(1, figsize=(6, 5))
plt.clf()
plt.scatter(x[:, 0], x[:, 1])
plt.axis('equal')
plt.grid(True)
linkage_matrix = linkage(x, "single")
plt.figure(2, figsize=(10, 4))
plt.clf()
plt.subplot(1, 2, 1)
show_leaf_counts = False
ddata = augmented_dendrogram(linkage_matrix,
color_threshold=1,
p=6,
truncate_mode='lastp',
show_leaf_counts=show_leaf_counts,
)
plt.title("show_leaf_counts = %s" % show_leaf_counts)
plt.subplot(1, 2, 2)
show_leaf_counts = True
ddata = augmented_dendrogram(linkage_matrix,
color_threshold=1,
p=6,
truncate_mode='lastp',
show_leaf_counts=show_leaf_counts,
)
plt.title("show_leaf_counts = %s" % show_leaf_counts)
plt.show()
这些是数据集中的点:
使用 p=6
和 trunc_mode="lastp"
,dendrogram
只显示顶部"的树状图.下面是show_leaf_counts
的效果.
With p=6
and trunc_mode="lastp"
, dendrogram
only shows the "top"
of the dendrogram. The following shows the effect of show_leaf_counts
.
这篇关于如何在 scipy/matplotlib 中绘制和注释层次聚类树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!