层次聚类树状图中的节点索引 [英] Node indexing in hierarchical clustering dendrograms

查看:68
本文介绍了层次聚类树状图中的节点索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想注释一个层次聚类

原始X矩阵有(X.shape[0] == 15)个样本,纵轴上的刻度标签对应每个样本的id树叶.每个节点的数字是 dendrogram 函数返回的那个节点的 id.现在,如果我们查看原始链接矩阵,第一两列给出了每个树节点的子节点,

print(Z[:,:2].astype('int'))[[0 3][ 1 8][ 6 16][ 2 5]...[22 24][23 25][26 27]]

例如,链接矩阵中的节点0有子节点[0, 3],但在上面的树状图中,它被标记为数字9.类似地,节点1,被标记为数字4,等等.

我想知道找到这两个索引之间对应关系的最简单方法是什么?我查看了 dendrogram 函数,但没有看到任何简单的方法(特别是如果我们将树状图截断到某个级别(例如 truncate_mode='level', p=2)...

注意:我实际上使用的是由 sklearn.cluster.AgglomerativeClustering 但这对于这个问题并不重要(如本github 问题).

注意2:或者,如果有一种方法可以计算每个树状图节点的叶子列表,也可以解决我的问题.

解决方案

我可以从发布的材料中推断出我希望您已经注意到的内容:

  1. 链接矩阵节点按聚类顺序编号.这原始节点为 0-14.[0 3] 成为节点 15,[1 8] 成为节点 16,[6 [1 8]] 是节点 17,依此类推
  2. 树状图节点从根开始编号,深度优先,上(右)分支在左(下)之前,标签从 N-1 到 0.
  3. 左右由链接矩阵中的列决定.

链接矩阵有2N+1行:N+1个原始节点和N个聚类.要重建树状图标签,请从矩阵的最后一行开始,[26 27].这将获得标签 N-1,或 13.在右侧节点(第 1 列)上重复,然后在左侧(第 0 列)上重复.每次分配标签值时递减它.

这足够清楚了吗?

<小时>

递归深度对您来说是个问题?对于 N 个节点,树深度不应超过 log2(N),对于一百万个节点,树深度不应超过 22-25.您的运行时堆栈无法处理?我的默认是一千.

无论如何,转换为迭代并不那么难.从根节点开始,创建一堆未解析的节点.

stack = [root]虽然堆栈不为空:节点 = stack.pop()stack.push (node.right)stack.push (node.left)process node # 分配节点号等.

这使得维护节点计数器变得更加容易,因为它现在是一个局部变量.还要注意,这是一个基本的图搜索算法:当我们有节点时,抓住列表中的下一个,将它的邻居推到列表上(但检查每个节点是否已经被处理),并处理当前节点."

对于深度优先,使用堆栈;对于广度优先,使用队列.

I'm looking to annotate a hierarchical clustering dendrogram, but I have some trouble associating the node indices produced by scipy.cluster.hierarchy.dendrogram when plotting, to the node indices in the original linkage matrix (e.g. produced with scipy.cluster.hierarchy.linkage).

For instance, say we have the following example (adapted from this SO question),

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
%matplotlib inline

# generate two clusters: a with 10 points, b with 5:
np.random.seed(1)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
# make distances between pairs of children uniform 
# (re-scales the horizontal (distance) axis when plotting)
Z[:,2] = np.arange(Z.shape[0])+1 

def plot_dendrogram(linkage_matrix, **kwargs):

    ddata = dendrogram(linkage_matrix, **kwargs)
    idx = 0
    for i, d, c in zip(ddata['icoord'], ddata['dcoord'], 
                       ddata['color_list']):       
        x = 0.5 * sum(i[1:3])
        y = d[1]
        plt.plot(y, x, 'o', c=c)
        plt.annotate("%.3g" % idx, (y, x), xytext=(15, 5),
                             textcoords='offset points',
                             va='top', ha='center')
        idx += 1
plot_dendrogram(Z, labels=np.arange(X.shape[0]),
                truncate_mode='level', show_leaf_counts=False, 
               orientation='left')

which produces the following dendrogram:

The original X matrix has (X.shape[0] == 15) samples, and the tick labels on the vertical axis corresponds to the sample id for each tree leaf. The number at each node is the id of that node as returned by the dendrogram function. Now if we look at the original linkage matrix, the 1st two columns give the children of each tree node,

print(Z[:,:2].astype('int'))
  [[ 0  3]
   [ 1  8]
   [ 6 16]
   [ 2  5]
   ...
   [22 24]
   [23 25]
   [26 27]]

For instance, the node 0 in the linkage matrix has for children leaves [0, 3], but on the dendrogram above it is labeled as number 9. Similarly the node 1, is labeled as number 4, etc.

I was wondering what would be the simplest way of finding the correspondence between these 2 indices? I looked at the dendrogram function but didn't see any simple way of doing that (particularly if we truncate the dendrogram to some level (with e.g. truncate_mode='level', p=2)...

Note: I'm actually using a linkage matrix given by sklearn.cluster.AgglomerativeClustering but that doesn't really matter for this question (as illustrated in this github issue).

Note2: alternatively if there is a way to compute the list of leaves for every dendrogram node that would also solve my problem.

解决方案

What I can deduce from the posted material are things I expect you've already noticed:

  1. The linkage matrix nodes are numbered in order of the clustering. The original nodes are 0-14. [0 3] becomes node 15, [1 8] is node 16, [6 [1 8]] is node 17, etc.
  2. The dendogram nodes are numbered from the root, depth-first, upper (right) branch before left (lower), with labels N-1 down to 0.
  3. Right and left are determined by the columns in the linkage matrix.

The linkage matrix has 2N+1 rows: N+1 original nodes and N clusterings. To reconstruct the dendogram labels, start at the last row of the matrix, [26 27]. This gets label N-1, or 13. Recur on the right node (column 1), then the left (column 0). Decrement the label value each time you assign it.

Is that clear enough to follow?


Recursion depth is a problem for you? The tree depth shouldn't be much more than log2(N) for N nodes, or about 22-25 for a million nodes. Your run-time stack can't handle that? Mine defaults to one thousand.

In any case, it's not that hard to convert to iteration. Create a stack of unresolved nodes, starting with the root node.

stack = [root]
while stack is not empty:
    node = stack.pop()
    stack.push (node.right)
    stack.push (node.left)
    process node  # assign node number, etc.

This makes it even easier to maintain the node counter, as it's now a local variable. Also note that this is a basic graph-search algorithm: "While we have nodes, grab the next one on the list, push its neighbors onto the list (but check to see whether each one is already handled), and process the current node."

For depth-first, use a stack; for breadth-first, use a queue.

这篇关于层次聚类树状图中的节点索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆