显示scipy树状图的群集标签 [英] Display cluster labels for a scipy dendrogram

查看:87
本文介绍了显示scipy树状图的群集标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用层次聚类来聚类词向量,并且我希望用户能够显示显示聚类的树状图.然而,由于可能有数千个单词,我希望这个树状图被截断为一些合理的有价值的,每个叶子的标签是该集群中最重要的单词的字符串.

I'm using hierarchical clustering to cluster word vectors, and I want the user to be able to display a dendrogram showing the clusters. However, since there can be thousands of words, I want this dendrogram to be truncated to some reasonable valuable, with the label for each leaf being a string of the most significant words in that cluster.

我的问题是,根据docs ,"labels [i]值是仅当它对应于原始观测值而不是非单一聚类时,才置于第i个叶节点下的文本." 这意味着我不能标记集群,只能标记奇异点?

My problem is that, according to the docs, "The labels[i] value is the text to put under the ith leaf node only if it corresponds to an original observation and not a non-singleton cluster." I take this to mean I can't label clusters, only singular points?

为了说明,这里有一个简短的 python 脚本,它生成一个简单的标记树状图:

To illustrate, here is a short python script which generates a simple labeled dendrogram:

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')

labelList = ["foo" for i in range(0, 20)]

plt.figure(figsize=(15, 12))
dendrogram(
            linked,
            orientation='right',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )
plt.show()

现在假设我想截断为 5 个叶子,并且对于每个叶子,将其标记为foo, foo, foo...",即构成该簇的单词.(注意:生成这些标签不是这里的问题.)我将其截断,并提供一个匹配的标签列表:

Now let's say I want to truncate to just 5 leaves, and for each leaf, label it like "foo, foo, foo...", ie the words that make up that cluster. (Note: generating these labels is not the issue here.) I truncate it, and supply a label list to match:

labelList = ["foo, foo, foo..." for i in range(0, 5)]
dendrogram(
            linked,
            orientation='right',
            p=5,
            truncate_mode='lastp',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=False
          )

这是问题所在,没有标签:

and here's the problem, no labels:

我在想这里可能会使用参数'leaf_label_func',但是我不确定如何使用它.

I'm thinking there might be a use here for the parameter 'leaf_label_func' but I'm not sure how to use it.

推荐答案

关于使用leaf_label_func参数,您是正确的.

You are correct about using the leaf_label_func parameter.

除了创建绘图之外,dendrogram 函数还返回一个包含多个列表的字典(他们在文档中将其称为 R).您创建的leaf_label_func必须从R ["leaves"]中获取一个值并返回所需的标签.设置标签的最简单方法是两次运行树状图.一次使用 no_plot=True 获取用于创建标签图的字典.然后再次创建图.

In addition to creating a plot, the dendrogram function returns a dictionary (they call it R in the docs) containing several lists. The leaf_label_func you create must take in a value from R["leaves"] and return the desired label. The easiest way to set labels is to run dendrogram twice. Once with no_plot=True to get the dictionary used to create your label map. And then again to create the plot.

randomMatrix = np.random.uniform(-10,10,size=(20,3))
linked = linkage(randomMatrix, 'ward')

labels = ["A", "B", "C", "D"]
p = len(labels)

plt.figure(figsize=(8,4))
plt.title('Hierarchical Clustering Dendrogram (truncated)', fontsize=20)
plt.xlabel('Look at my fancy labels!', fontsize=16)
plt.ylabel('distance', fontsize=16)

# call dendrogram to get the returned dictionary 
# (plotting parameters can be ignored at this point)
R = dendrogram(
                linked,
                truncate_mode='lastp',  # show only the last p merged clusters
                p=p,  # show only the last p merged clusters
                no_plot=True,
                )

print("values passed to leaf_label_func\nleaves : ", R["leaves"])

# create a label dictionary
temp = {R["leaves"][ii]: labels[ii] for ii in range(len(R["leaves"]))}
def llf(xx):
    return "{} - custom label!".format(temp[xx])

## This version gives you your label AND the count
# temp = {R["leaves"][ii]:(labels[ii], R["ivl"][ii]) for ii in range(len(R["leaves"]))}
# def llf(xx):
#     return "{} - {}".format(*temp[xx])


dendrogram(
            linked,
            truncate_mode='lastp',  # show only the last p merged clusters
            p=p,  # show only the last p merged clusters
            leaf_label_func=llf,
            leaf_rotation=60.,
            leaf_font_size=12.,
            show_contracted=True,  # to get a distribution impression in truncated branches
            )
plt.show()

这篇关于显示scipy树状图的群集标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆