如何使用Lucene和Java与tf-idf计算余弦相似度 [英] How to Calculate cosine similarity with tf-idf using Lucene and Java

查看:140
本文介绍了如何使用Lucene和Java与tf-idf计算余弦相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询和一组文档.我需要根据与tf-idf的余弦相似度对这些文档进行排名.有人可以告诉我我可以从Lucene那里得到什么支持来进行计算吗?我可以直接从Lucene计算哪些参数(我可以直接通过Lucene中的某种方法获取tf,idf吗?),以及如何计算与Lucene的余弦相似度(如果我传递了查询的两个向量,是否有任何函数可以直接返回余弦相似度?文档?)

I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)

预先感谢

推荐答案

Lucene已经使用余弦相似性的简化版本,因此,如果您需要原始CS本身,则可能是可行的.我推荐官方页面,其中讨论了Lucene评分.

Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.

如果您想自己提取该信息,这将是 tf 的步骤的概述:

If you want to extract that info on your own, this would be an outline of the steps for tf:

    索引语料;
  1. 打开一个IndexReader;
  2. 遍历所有文档ID,从0到maxDoc()
  3. getTermFreqVector(doc, fieldName);
  4. 迭代并行数组tfv.getTerms()tfv.getTermFrequencies().
  1. index the corpus;
  2. open an IndexReader;
  3. iterate over all doc ids, 0 to maxDoc();
  4. getTermFreqVector(doc, fieldName);
  5. iterate over the parallel arrays tfv.getTerms() and tfv.getTermFrequencies().

对于 docFreq ,使用IndexReader.terms()并遍历此调用termEnum.docFreq().

As for the docFreq, use IndexReader.terms() and iterate over this calling termEnum.docFreq().

这篇关于如何使用Lucene和Java与tf-idf计算余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆