如何在scipy/matplotlib中绘制和注释分层聚类树状图 [英] how to plot and annotate hierarchical clustering dendrograms in scipy/matplotlib
问题描述
我正在使用scipy
中的dendrogram
来绘制使用matplotlib
的层次聚类,如下所示:
I'm using dendrogram
from scipy
to plot hierarchical clustering using matplotlib
as follows:
mat = array([[1, 0.5, 0.9],
[0.5, 1, -0.5],
[0.9, -0.5, 1]])
plt.subplot(1,2,1)
plt.title("mat")
dist_mat = mat
linkage_matrix = linkage(dist_mat,
"single")
print "linkage2:"
print linkage(1-dist_mat, "single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
plt.subplot(1,2,2)
plt.title("1 - mat")
dist_mat = 1 - mat
linkage_matrix = linkage(dist_mat,
"single")
dendrogram(linkage_matrix,
color_threshold=1,
labels=["a", "b", "c"],
show_leaf_counts=True)
我的问题是:首先,为什么mat
和1-mat
在这里给出相同的聚类?其次,如何使用dendrogram
注释沿树的每个分支的距离,以便可以比较节点对之间的距离?
My questions are: first, why does mat
and 1-mat
give identical clusterings here? and second, how can I annotate the distance along each branch of the tree using dendrogram
so that the distances between pairs of nodes can be compared?
最后看来,show_leaf_counts
标志被忽略了,有没有办法打开它,以便显示每个类中的对象数量?谢谢.
finally it seems that show_leaf_counts
flag is ignored, is there a way to turn it on so that the number of objects in each class is shown? thanks.
推荐答案
linkage()
的输入是n x m数组,表示数组中的n个点.
m维空间或包含 condensed 距离矩阵.在您的示例中,mat
为3 x 3,因此您正在聚类
三个3-d点.聚类是基于这些点之间的距离.
The input to linkage()
is either an n x m array, representing n points in
m-dimensional space, or a one-dimensional array containing the condensed distance matrix. In your example, mat
is 3 x 3, so you are clustering
three 3-d points. Clustering is based on the distance between these points.
为什么mat和1-mat在这里给出相同的聚类?
数组mat
和1-mat
产生相同的聚类,因为聚类
基于点之间的距离,并且都不基于反射(-mat
)
整个数据集的转换(mat + offset
)也不会改变相对
点之间的距离.
The arrays mat
and 1-mat
produce the same clustering because the clustering
is based on distances between the points, and neither a reflection (-mat
)
nor a translation (mat + offset
) of the entire data set change the relative
distances between the points.
如何使用树状图注释沿树的每个分支的距离,以便可以比较节点对之间的距离?
在下面的代码中,我
展示如何使用树状图返回的数据标记水平
图中具有相应距离的线段.关联的值
用键icoord
和dcoord
给出每个的x和y坐标
该图的三段倒U型.在augmented_dendrogram
中,此数据
用于添加每个水平线的距离(即y值)的标签
树状图中的线段.
In the code below, I
show how you can use the data returned by dendrogram to label the horizontal
segments of the diagram with the corresponding distance. The values associated
with the keys icoord
and dcoord
give the x and y coordinates of each
three-segment inverted-U of the figure. In augmented_dendrogram
this data
is used to add a label of the distance (i.e. y value) of each horizontal
line segment in dendrogram.
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
def augmented_dendrogram(*args, **kwargs):
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
for i, d in zip(ddata['icoord'], ddata['dcoord']):
x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')
plt.annotate("%.3g" % y, (x, y), xytext=(0, -8),
textcoords='offset points',
va='top', ha='center')
return ddata
对于您的mat
阵列,增强树状图为
For your mat
array, the augmented dendrogram is
因此点'a'和'c'相距1.01个单位,而点'b'相距1.57个单位 群集['a','c'].
So point 'a' and 'c' are 1.01 units apart, and point 'b' is 1.57 units from the cluster ['a', 'c'].
似乎show_leaf_counts
标志已被忽略,有没有办法将其打开
这样就可以显示每个类中的对象数量?
It seems that show_leaf_counts
flag is ignored, is there a way to turn it on
so that the number of objects in each class is shown?
标志show_leaf_counts
仅在并非所有原始数据时都适用
点显示为叶子.例如,当trunc_mode = "lastp"
时,
仅显示最后的p
个节点.
The flag show_leaf_counts
only applies when not all the original data
points are shown as leaves. For example, when trunc_mode = "lastp"
,
only the last p
nodes are show.
这里有一个100分的例子:
Here's an example with 100 points:
import numpy as np
from scipy.cluster.hierarchy import linkage
import matplotlib.pyplot as plt
from augmented_dendrogram import augmented_dendrogram
# Generate a random sample of `n` points in 2-d.
np.random.seed(12312)
n = 100
x = np.random.multivariate_normal([0, 0], np.array([[4.0, 2.5], [2.5, 1.4]]),
size=(n,))
plt.figure(1, figsize=(6, 5))
plt.clf()
plt.scatter(x[:, 0], x[:, 1])
plt.axis('equal')
plt.grid(True)
linkage_matrix = linkage(x, "single")
plt.figure(2, figsize=(10, 4))
plt.clf()
plt.subplot(1, 2, 1)
show_leaf_counts = False
ddata = augmented_dendrogram(linkage_matrix,
color_threshold=1,
p=6,
truncate_mode='lastp',
show_leaf_counts=show_leaf_counts,
)
plt.title("show_leaf_counts = %s" % show_leaf_counts)
plt.subplot(1, 2, 2)
show_leaf_counts = True
ddata = augmented_dendrogram(linkage_matrix,
color_threshold=1,
p=6,
truncate_mode='lastp',
show_leaf_counts=show_leaf_counts,
)
plt.title("show_leaf_counts = %s" % show_leaf_counts)
plt.show()
这些是数据集中的点:
对于p=6
和trunc_mode="lastp"
,dendrogram
仅显示顶部"
树状图.下面显示了show_leaf_counts
的效果.
With p=6
and trunc_mode="lastp"
, dendrogram
only shows the "top"
of the dendrogram. The following shows the effect of show_leaf_counts
.
这篇关于如何在scipy/matplotlib中绘制和注释分层聚类树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!