Lucene 4.9:从索引中获取一些选定文档的TF-IDF [英] Lucene 4.9: Get TF-IDF for a few selected documents from an Index

查看:155
本文介绍了Lucene 4.9:从索引中获取一些选定文档的TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在stackoverflow以及其他在线资源上看到了很多类似的问题.但是,看起来Lucene API的相应部分发生了很大变化,因此可以总结一下:我找不到任何适用于最新Lucene版本的示例.

我所拥有的:

  • Lucene索引+ IndexReader + IndexSearcher
  • 一堆文件(及其ID)

我想要什么: 对于在至少一个所选文档中仅 出现的所有术语,我想为每个文档获取TF-IDF. 或换句话说:我想获得在任何选定文档中出现的任何术语的TF-IDF值,例如,作为数组(即,每个选定文档的一个TF-IDF值).

我们非常感谢您的帮助! :-)

这是到目前为止我要提出的内容,但是有两个问题:

  1. 它使用的是临时创建的RAMDirectory,其中仅包含选定的文档.有什么方法可以处理原始索引还是没有意义?
  2. 它没有获得基于文档的TF IDF,而是以某种方式仅基于索引,即所有文档.这意味着对于每个术语,我只会得到一个TF-IDF值,而对于每个文档和术语却没有一个.


 public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {

    Bits liveDocs = MultiFields.getLiveDocs(reader);
    TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
    BytesRef term = null;
    TFIDFSimilarity tfidfSim = new DefaultSimilarity();
    int docCount = reader.numDocs();

    while ((term = termEnum.next()) != null) {
        String termText = term.utf8ToString();
        Term termInstance = new Term(field, term);
        // term and doc frequency in all documents
        long indexTf = reader.totalTermFreq(termInstance); 
        long indexDf = reader.docFreq(termInstance);       
        double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
        // store it, but that's not the problem
 

解决方案

totalTermFreq听起来像那样,提供整个索引的频率.计算中的TF应该是文档中的术语频率,而不是整个索引中的频率.这就是为什么您在此处获得的所有内容都是恒定的,所有变量在整个索引中都是恒定的,而不依赖于文档.为了获得文档的词频,您应该使用 DocsEnum.freq() .也许像这样:

 while ((term = termEnum.next()) != null) {
    Term termInstance = new Term(field, term);
    long indexDf = reader.docFreq(termInstance);      

    DocsEnum docs = termEnum.docs(reader.getLiveDocs())
    while(docs.next() != DocsEnum.NO_MORE_DOCS) {
        double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
        // store it
 

I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.

What I have:

  • Lucene Index + IndexReader + IndexSearcher
  • a bunch of documents (and their IDs)

What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).

Any help is highly appreciated! :-)

Here's what I've come up with so far, but there are 2 problems:

  1. It is using a temporarily created RAMDirectory which contains only the selected Documents. Is there any way to work on the original Index or does that not make sense?
  2. It does not get document based TF IDF but somehow only index based, ie., all documents. Which means for each term I only get one TF-IDF value but not one for each document and term.


public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {

    Bits liveDocs = MultiFields.getLiveDocs(reader);
    TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
    BytesRef term = null;
    TFIDFSimilarity tfidfSim = new DefaultSimilarity();
    int docCount = reader.numDocs();

    while ((term = termEnum.next()) != null) {
        String termText = term.utf8ToString();
        Term termInstance = new Term(field, term);
        // term and doc frequency in all documents
        long indexTf = reader.totalTermFreq(termInstance); 
        long indexDf = reader.docFreq(termInstance);       
        double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
        // store it, but that's not the problem

解决方案

totalTermFreq does what it sounds like, provide the frequency across the entire index. The TF in the calculation should be the term frequency within the document, not across the entire index.. That's why everything you get here is constant, all of your variables are constant across the entire index, non are dependant on the document. In order to get term frequency for a document, you should use DocsEnum.freq(). Perhaps something like:

while ((term = termEnum.next()) != null) {
    Term termInstance = new Term(field, term);
    long indexDf = reader.docFreq(termInstance);      

    DocsEnum docs = termEnum.docs(reader.getLiveDocs())
    while(docs.next() != DocsEnum.NO_MORE_DOCS) {
        double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
        // store it

这篇关于Lucene 4.9:从索引中获取一些选定文档的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆