使用 elasticsearch 进行文档聚类的便捷方法是什么? [英] What is a convenient way to do document clustering with elasticsearch?

查看:22
本文介绍了使用 elasticsearch 进行文档聚类的便捷方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 elasticsearch 索引中存储了很多来自不同来源的 RSS 提要的新闻文章.在我进行搜索查询的那一刻,它会针对一个查询返回很多类似的新闻文章,因为许多 RSS 源都涵盖了相同的新闻主题.

相反,我想做的是只返回一组文章中的一篇新闻文章到同一主题.因此,我需要以某种方式识别哪些文章是关于同一主题的,将这些文档聚类并仅从此类聚类中返回最佳"文章.

解决这个问题最方便的方法是什么?我可以以某种方式使用 elasticsearch more-like-this API 吗?或者 https://github.com/carrot2/elasticsearch-carrot2 插件是要走的路?或者根本没有方便的方法,我必须以某种方式实现我自己的 http://en 版本.wikipedia.org/wiki/K-means_clusteringhttp://en.wikipedia.org/wiki/Non-negative_matrix_factorization 对我的文档进行聚类?

解决方案

  1. ES 对于聚类不是特别有用.大多数聚类算法需要成对距离计算,如果您可以将所有数据放入一个巨大的矩阵中(然后将其分解),这是最简单的因此,在 ES 之外工作可能更容易(也更快)!

  2. 这些方法都没有宣传的那么好.见例如读茶叶".构建这种算法的每个人都乐于得到任何结果,并且会调整和调整参数并重新运行,直到结果看起来不错.技术术语是樱桃采摘.评估非常草率,如果您仔细查看结果,它们并不比选择随机关键字(例如汽车)并对其进行文本搜索更好.比主题模型发现的那些在实践中没有人能够破译的主题"更有意义.祝你好运...

<块引用>

Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., &布莱, D. M. (2009).阅读茶叶:人类如何解释主题模型.神经信息处理系统进展 (pp. 288-296)

I have stored a lot of news articles from RSS feeds from different sources in an elasticsearch index. At the moment when I do a search query, it will return me a lot of similar news articles for one query, because the same news topics gets covered by many RSS sources.

Instead what I would like to do is return only one news article out of a group of articles to the same topic. So I somehow need to recognize, which articles are about the same topic, cluster these documents and return only the "best" article out of such a cluster.

What would be the most convenient way to approach that problem? Can I somehow make use of the elasticsearch more-like-this API? Or is the https://github.com/carrot2/elasticsearch-carrot2 plugin the way to go? Or is there simply no convenient way and I have to implement somehow my own version of http://en.wikipedia.org/wiki/K-means_clustering or http://en.wikipedia.org/wiki/Non-negative_matrix_factorization to cluster my documents?

解决方案

  1. ES is not particularly useful for clustering. Most clustering algorithms require pairwise distance computations, which is easiest if you can fit all your data into a huge matrix (and then factor it) So it may well be easier (and faster) to work outside ES!

  2. None of the approaches work half as good as advertised. See e.g. "reading tea leaves". Everybody who constructs such an algorithm is happy to get anything out, and will tune and fiddle parameters and rerun until the result looks nice. The technical term is cherry picking. Evaluation is incredibly sloppy, and if you look at the results closely, they aren't any better than choosing a random key word (say, car) and doing a text search on that. Much more meaningful than those "topics" discovered by topic models that nobody can decipher in practise. So good luck...

Chang, J., Gerrish, S., Wang, C., Boyd-graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296)

这篇关于使用 elasticsearch 进行文档聚类的便捷方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆