什么是实现Solr的条款协矿的最简单的方法? [英] What is the easiest way to implement terms association mining in Solr?

查看:149
本文介绍了什么是实现Solr的条款协矿的最简单的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

协矿似乎给了良好的效果在文本语料库检索相关条款即可。有关于这个主题的几部作品,包括著名的 LSA 方法。矿协会最直接的方法是建设文档点¯x条款共生矩阵,发现最常发生在同一个文件条款。在我的previous项目我实现了直接在Lucene中通过迭代过TermDocs(我调用了它<一个href=\"http://lucene.apache.org/java/3_3_0/api/all/org/apache/lucene/index/IndexReader.html#termDocs%28org.apache.lucene.index.Term%29\">IndexReader.termDocs(Term)).但我不能看到Solr中类似的事情。

所以,我的需求的是:


  1. 要检索特定领域内的最相关的方面

  2. 要检索的来看,这是最接近指定的一个特定领域内。

我会的率的答案的按以下方式:


  1. 理想我想找到Solr的组件,它直接覆盖特定的需求,那就是东西直接拿到相关的条款。

  2. 如果这是不可能的,我正在寻找用于获得指定字段共生矩阵信息的方式。

  3. 如果这是不是一种选择也一样,我想知道最简单的方法:1)获得的所有条款和2)获得IDS(数字)的文件发生在这些条款。


解决方案

由于还有我的问题没有答案,我必须写我自己的想法,并接受它。不过,如果有人提出了更好的解决方案,我会高兴地接受它,而不是我的。

我会共生矩阵去,因为它是关联规则挖掘的最主要的组成部分。一般情况下,Solr的提供所有需要的功能以某种方式构建这个矩阵,虽然它们并不像使用Lucene直接访问效率高。构建矩阵,我们需要:


  1. 所有条款或至少最频繁的的人后,因为罕见的条款将不会受到本质上的影响关联规则挖掘的结果。

  2. 其中,这些条款发生,再次文件,至少上面的文件。

这两个任务可以轻松地与标准Solr的组件完成。

要检索方面 TermsComponent 或的面搜索,可以使用。我们只能得到最多的字词(默认)或所有术语(通过设置最大数量的术语来取,详情请参阅特定功能的文档)。

掌握有关的词的文档只是搜索这个词。这里的不足之处是,我们需要每学期1个请求,并有可能成为成千上万的条款。另一个不足之处是既不简单,也不是面搜索不提供有关发现文档中的当期发生的计数信息。

到这一点,很容易建立共生矩阵。为了挖掘关联也可以使用其他软件如 Weka的或写自己的实现的,比方说, Apriori算法

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.

So, my needs are:

  1. To retrieve the most associated terms within particular field.
  2. To retrieve the term, that is closest to the specified one within particular field.

I will rate answers in the following way:

  1. Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
  2. If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
  3. If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.

解决方案

Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.

I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:

  1. All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
  2. Documents where these terms occur, again, at least top documents.

Both these tasks may be easily done with standard Solr components.

To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).

Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.

Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.

这篇关于什么是实现Solr的条款协矿的最简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆