如何使用Lucene和Java与tf-idf计算余弦相似度 [英] How to Calculate cosine similarity with tf-idf using Lucene and Java
问题描述
我有一个查询和一组文档.我需要根据与tf-idf的余弦相似度对这些文档进行排名.有人可以告诉我我可以从Lucene那里得到什么支持来进行计算吗?我可以直接从Lucene计算哪些参数(我可以直接通过Lucene中的某种方法获取tf,idf吗?),以及如何计算与Lucene的余弦相似度(如果我传递了查询的两个向量,是否有任何函数可以直接返回余弦相似度?文档?)
I have a query and a set of documents. I need to rank these documents based on the cosine similarity with tf-idf. Can someone please tell me what support I can get from Lucene to compute this ? What parameters I can directly calculate from Lucene (can I get tf, idf directly through some method in lucene?) and how to compute cosine similarity with Lucene (is there any function which directly returns cosine similarity if I pass two vectors of the query and the document ?)
预先感谢
推荐答案
Lucene已经使用余弦相似性的简化版本,因此,如果您需要原始CS本身,则可能是可行的.我推荐官方页面,其中讨论了Lucene评分.
Lucene already uses a pimped version of cosine similarity, so if you need the raw CS itself, it's probably doable. I recommend the official page that discusses Lucene scoring.
如果您想自己提取该信息,这将是 tf 的步骤的概述:
If you want to extract that info on your own, this would be an outline of the steps for tf:
-
索引语料;
- 打开一个
IndexReader
; - 遍历所有文档ID,从0到
maxDoc()
; -
getTermFreqVector(doc, fieldName);
- 迭代并行数组
tfv.getTerms()
和tfv.getTermFrequencies()
.
- index the corpus;
- open an
IndexReader
; - iterate over all doc ids, 0 to
maxDoc()
; getTermFreqVector(doc, fieldName);
- iterate over the parallel arrays
tfv.getTerms()
andtfv.getTermFrequencies()
.
对于 docFreq ,使用IndexReader.terms()
并遍历此调用termEnum.docFreq()
.
As for the docFreq, use IndexReader.terms()
and iterate over this calling termEnum.docFreq()
.
这篇关于如何使用Lucene和Java与tf-idf计算余弦相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!