以最高纯度切割树状图 [英] Cutting dendrogram at highest level of purity

查看:64
本文介绍了以最高纯度切割树状图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建使用分层聚类聚类对文档进行聚类的程序,该程序的输出取决于将树状图切割到我获得最大纯度的水平。

I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity.

下面是我现在正在使用的算法。

So following is the algorithm I am working on right now.

Create dedrogram for the documents in the dataset
purity = 0
final_clusters
for all the levels, lvl, in the dendrogram
    clusters = cut dendrogram at lvl
    new_purity = calculate_purity_of(clusters)
    if new_purity > purity
        purity = new_purity
        final_clusters = clusters

根据此算法,我得到

问题是,当我将树状图切割到最低级别时,每个簇仅包含一个文档,表示纯度为100%,因此簇的平均纯度为1.0。但这不是所需的输出。我想要的是适当的文档分组。我做错什么了吗?

The problem is, when I cut the dendrogram at lowest level, every cluster contains only one document, which means it is 100% pure, therefore average purity of clusters is 1.0. But this is not the desired output. What I want is proper grouping of documents. Am I doing something wrong?

推荐答案

您使用的方法过于简单。

You are using a too simple measure.

是的,关于纯度的最佳解决方案是仅合并重复的对象,以便每个簇在定义上保持纯净。

Yes, the "optimal" solution with respect to purity is to only merge duplicate objects, so that each cluster remains pure by definition.

这就是为什么要进行优化的原因数学准则通常不是解决实际数据问题的正确方法。相反,您需要问自己一个问题:什么会是有趣的结果,在数学意义上,有趣和优化并不相同。

This is why optimizing a mathematical criterion often isn't the right approach to tackle a real data problem. Instead, you need to ask yourself the question: "what would be an interesting result", where interesting is not the same as optimal in a mathematical sense.

对不起,我无法为您提供更好的答案-但我没有您的数据。

Sorry that I cannot give you a better answer - but I don't have your data.

恕我直言,任何抽象的数学方法都会遭受同样的命运。您需要数据用户需求指定要集群的内容,而不是一些统计数字;因此不要用数学来寻找答案,而是要看数据和用户需求。

IMHO, any abstract mathematical approach will suffer from the same fate. You need to have your data and user needs specify what to cluster, not some statistical number; so don't look in mathematics for the answer, but look at your data and your user needs.

这篇关于以最高纯度切割树状图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆