比tf/idf和余弦相似度更好的文本文档聚类? [英] Better text documents clustering than tf/idf and cosine similarity?

查看：112 发布时间：2020/5/4 9:26:28 machine-learning data-mining cluster-analysis text-mining

本文介绍了比tf/idf和余弦相似度更好的文本文档聚类?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对Twitter流进行群集.我想将每条推文放到讨论同一主题的集群中.我尝试使用具有tf/idf和余弦相似性的在线聚类算法对流进行聚类，但发现结果非常糟糕.

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

使用tf/idf的主要缺点是，它会将关键字相似的文档聚集在一起，因此，最好标识几乎相同的文档.例如，考虑以下句子:

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1-网站Stackoverflow是一个不错的地方. 2- Stackoverflow是一个网站.

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

由于两个句子共享大量关键字，因此很可能将两个句子与合理的阈值聚在一起.但现在考虑以下两个句子:

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1-网站Stackoverflow是一个不错的地方. 2-我会定期访问Stackoverflow.

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

现在，通过使用tf/idf，聚类算法将惨遭失败，因为即使它们都谈论同一主题，它们也只共享一个关键字.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

我的问题:是否有更好的技术来对文档进行聚类?

My question: is there better techniques to cluster documents?

比tf/idf和余弦相似度更好的文本文档聚类? [英] Better text documents clustering than tf/idf and cosine similarity?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

比tf/idf和余弦相似度更好的文本文档聚类? [英] Better text documents clustering than tf/idf and cosine similarity?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭