比tf/idf和余弦相似度更好的文本文档聚类? [英] Better text documents clustering than tf/idf and cosine similarity?

查看:112
本文介绍了比tf/idf和余弦相似度更好的文本文档聚类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对Twitter流进行群集.我想将每条推文放到讨论同一主题的集群中.我尝试使用具有tf/idf和余弦相似性的在线聚类算法对流进行聚类,但发现结果非常糟糕.

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

使用tf/idf的主要缺点是,它会将关键字相似的文档聚集在一起,因此,最好标识几乎相同的文档.例如,考虑以下句子:

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1-网站Stackoverflow是一个不错的地方. 2- Stackoverflow是一个网站.

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

由于两个句子共享大量关键字,因此很可能将两个句子与合理的阈值聚在一起.但现在考虑以下两个句子:

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1-网站Stackoverflow是一个不错的地方. 2-我会定期访问Stackoverflow.

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

现在,通过使用tf/idf,聚类算法将惨遭失败,因为即使它们都谈论同一主题,它们也只共享一个关键字.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

我的问题:是否有更好的技术来对文档进行聚类?

My question: is there better techniques to cluster documents?

推荐答案

根据我的经验,潜在语义分析(LSA/LSI)向量在文本聚类方面比原始tf-idf更好,尽管我承认我还没有在Twitter数据上尝试过.特别是,它往往可以解决您遇到的稀疏性问题,即文档中没有足够的通用术语.

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

LDA之类的主题模型可能会更好.

Topic models such as LDA might work even better.

这篇关于比tf/idf和余弦相似度更好的文本文档聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆