在 Solr 中实现术语关联挖掘的最简单方法是什么? [英] What is the easiest way to implement terms association mining in Solr?

查看:28
本文介绍了在 Solr 中实现术语关联挖掘的最简单方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关联挖掘似乎在检索文本语料库中的相关术语方面取得了不错的结果.有几个关于这个主题的作品,包括著名的 LSA 方法.挖掘关联最直接的方法是构建docs Xterms的共现矩阵,找出在同一文档中最常出现的词条.在我之前的项目中,我通过 TermDocs 迭代直接在 Lucene 中实现它(我通过调用 IndexReader.termDocs(Term)).但是我在 Solr 中看不到任何类似的东西.

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.

所以,我的需求是:

  1. 检索特定字段中最相关的术语.
  2. 检索与特定字段中指定的词最接近的词.

我将通过以下方式评价答案:

  1. 理想情况下,我想找到直接覆盖指定需求的Solr组件,即直接获取关联术语的东西.
  2. 如果这是不可能的,我正在寻找获取指定字段的共现矩阵信息的方法.
  3. 如果这也不是一个选项,我想知道 1) 获取所有术语和 2) 获取这些术语出现的文档的 id(数量)的最直接方法.

推荐答案

既然我的问题还没有答案,我只好写下自己的想法并接受.尽管如此,如果有人提出更好的解决方案,我会很乐意接受它而不是我的.

Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.

我将使用共现矩阵,因为它是关联挖掘中最重要的部分.一般来说,Solr 以某种方式提供了构建这个矩阵所需的所有函数,尽管它们不如直接访问 Lucene 那样有效.要构造矩阵,我们需要:

I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:

  1. 所有词条或至少是最常用的词条,因为稀有词条的性质不会影响关联挖掘的结果.
  2. 出现这些术语的文档,同样,至少是顶级文档.
  1. All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
  2. Documents where these terms occur, again, at least top documents.

使用标准 Solr 组件可以轻松完成这两项任务.

Both these tasks may be easily done with standard Solr components.

检索术语 TermsComponent分面搜索 可以使用.我们只能获得最高条款(默认情​​况下)或所有条款(通过设置要采用的最大条款数,有关详细信息,请参阅特定功能的文档).

To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).

获取包含相关术语的文档只是搜索该术语.这里的弱点是我们每个词条需要 1 个请求,而且可能有数千个词条.另一个弱点是,无论是简单搜索还是分面搜索都不提供有关当前术语在已找到文档中出现次数的信息.

Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.

有了这个,很容易建立共现矩阵.要挖掘关联,可以使用其他软件,例如 Weka 或编写自己的实现例如,Apriori 算法.

Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.

这篇关于在 Solr 中实现术语关联挖掘的最简单方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆