仅根据lucene中有更多术语出现的文件来计算得分 [英] Calculate the score only based on the documents have more occurance of term in lucene

查看:168
本文介绍了仅根据lucene中有更多术语出现的文件来计算得分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始研究基于lucene.net引擎的简历检索(文档)组件。它工作得很好,它可以获取文档并根据

I am started working on resume retrieval(document) component based on lucene.net engine. It works great, and it fetches the document and score it based on the


得出它的结果VSM背后的想法是
次更多查询字词出现在
文档中,相对于该字词在集合中所有
文档中出现的
次数,该文档与$ b相关的
越多$ b查询。

the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query.

Lucene的实用评分函数来自以下。

Lucene's Practical Scoring Function is derived from the below.

score(q,d)=coord(q,d)·queryNorm(q)· ∑( tf(t in d) ·idf(t)2 · t.getBoost() · norm(t,d) ) 
                                  t in q

in this


  • tf(t in d)与术语频率相关,定义为术语t出现在当前得分文档中的次数d。具有更多特定术语的文档会获得更高的分数

  • idf(t)代表反向文档频率。该值与docFreq的倒数(术语t出现的文档数)相关。这意味着更罕见的条款对总分的贡献更高。

  • tf(t in d) correlates to the term's frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score
  • idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score.

这在大多数情况下确实非常好,但由于fieldnorm计算结果不准确

This is very great indeed in most of the situation, but due to the fieldnorm calculation the result is not accurate

fieldnorm又名字段长度范数值表示该文档中该字段的长度(因此较短的字段会自动提升)。

由于这个原因,我们没有得到准确的结果。
比如说我有10000个文件,其中3000个文件有java和oracle关键字。并且它的出现次数各不相同。

Due to this we didn't get the accurate results. Say for an example i got 10000 documents in which 3000 documents got java and oracle keyword. And the no of times it appears vary on each document.


  • 假设doc A在1000个单词中得到10个java 20 oracle而doc B得到2个50个单词中的java 2 oracle

  • 如果正在搜索java和oracle查询,由于长度规范化,lucene返回高分
    的文档B.

由于业务性质我们需要检索文件得到更多搜索关键字出现应该先到,我们真的不在乎长度该文件。

Due to the nature of the business we need to retrieve the documents got more search keyword occurrence should come first, we don't really care about the length of the document.

由于这个原因,一个带有大量关键字的简历的Guy被移到了结果的下面,并且出现了一些小的简历。

Because of this a Guy with a big resume with lot of keywords is been moved below in the result and some small resumes came up.

为了避免我需要禁用长度标准化。有人可以帮我这个吗?

To avoid that i need to disable length normalization. Can some one help me with this??

我附上了Luke结果图片供你参考。

I have attached the Luke result image for your reference.

在此图像中,使用java 50次和oracle 6次的文档向下移动到第11位。

In this image, document with java 50 times and oracle 6 times moved down to 11 th position.

但由于fieldnorm,这个带有24次和oracle 5次的文件是最佳射手。

But this document with java 24 times and oracle 5 times is a top scorer due to the fieldnorm.

希望我清楚地传达信息......如果没有,请问我,我将提供更多信息

Hope i conveyed the info clear... If not please ask me, i ll give more info

推荐答案

您可以使用 Field.setOmitNorms(true)

这篇关于仅根据lucene中有更多术语出现的文件来计算得分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆