什么是布朗聚类算法的输出是什么意思? [英] What does the Brown clustering algorithm output mean?
问题描述
我从 https://github.com/percyliang/brown-cluster 跑棕聚类算法也是一个Python实现 https://github.com/mheilman/tan-clustering 。他们都给予每一个独特的标记另一个整数某种形式的二进制和。例如:
I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:
0 the 6
10 chased 3
110 dog 2
1110 mouse 2
1111 cat 2
什么是二进制和整数是什么意思?
从第一链接,二进制就是所谓的位串
,见<一href="http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/">http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/
From the first link, the binary is known as a bit-string
, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/
但我怎么知道从狗和老鼠和猫
是一个集群和的追赶
输出不在同一群集
But how do I tell from the output that dog and mouse and cat
is one cluster and the and chased
is not in the same cluster?
推荐答案
如果我理解正确的话,该算法为您提供了一棵树,你需要截断它在一定程度上得到集群。在这些情况下,位串,你应该只取第一→
字符。
If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L
characters.
例如,在切割第二个字符为您提供了两个群集
For example, cutting at the second character gives you two clusters
10 chased
11 dog
11 mouse
11 cat
在第三个字符,你得到
At the third character you get
110 dog
111 mouse
111 cat
切割策略是不同的主体,但。
The cutting strategy is a different subject though.
这篇关于什么是布朗聚类算法的输出是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!