Brown聚类算法的输出是什么意思? [英] What does the Brown clustering algorithm output mean?

查看:24
本文介绍了Brown聚类算法的输出是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经从 https://github.com/percyliang/brown-cluster 运行了棕色聚类算法还有一个 python 实现 https://github.com/mheilman/tan-clustering.他们都为每个唯一的标记提供某种二进制和另一个整数.例如:

I've ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And they both give some sort of binary and another integer for each unique token. For example:

0        the        6
10        chased        3
110        dog        2
1110        mouse        2
1111        cat        2

二进制和整数是什么意思?

从第一个 link 开始,二进制文件被称为 bit-string,参见http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

From the first link, the binary is known as a bit-string, see http://saffron.deri.ie/acl_acl/document/ACL_ANTHOLOGY_ACL_P11-1053/

但是我如何从输出中判断dog and mouse and cat 是一个集群,而the andchaped 不在同一个集群中?

But how do I tell from the output that dog and mouse and cat is one cluster and the and chased is not in the same cluster?

推荐答案

如果我理解正确,该算法会为您提供一棵树,您需要在某个级别截断它以获得集群.对于那些位串,你应该只取第一个 L 字符.

If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

例如,在第二个字符处切割会得到两个簇

For example, cutting at the second character gives you two clusters

10           chased     

11           dog        
11           mouse      
11           cat        

在你得到的第三个字符处

At the third character you get

110           dog        

111           mouse      
111           cat        

不过,切割策略是一个不同的主题.

The cutting strategy is a different subject though.

这篇关于Brown聚类算法的输出是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆