Solr:长度归一化/ omitNorms = false有什么好处? [英] Solr: What are the benefits of length normalization/omitNorms=false?

查看:1068
本文介绍了Solr:长度归一化/ omitNorms = false有什么好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用Solr来搜索不同长度的文章。我们对描述性元数据(标题,作者,类别,关键字等)和完整的文章文本进行索引。我们不会在指数时间增加相关性 - 所有提升都是在查询时完成的(我们使用dismax,加上各种qf,pf和bf增量)。

目前我们全文字段使用标准omitNorms = false;结果,所有其他条件相同,较短的文章(2-3列英寸的文章)通常比较长的特征长度(多页)文章具有更高的相关性。

在我们的案例中,文章长度是相关性的重要指标,因此我正在考虑在全文字段中设置omitNorms = true。



问题:1.为什么默认lucene / solr行为来提高更短的字段长度?什么是推理?
2.为什么我不想忽略Norms?我不需要提高对这个特定领域的查询,也不需要在这个领域使用任何形式。

解决方案

问题1 :



在更高的字段长度上提供更短的字段长度与确定文档相关性的基本概念(称为TF-IDF)有关(请参阅 http://en.wikipedia.org/wiki/Tf%E2%80%93idf )。作为一个简短的例子,考虑你的搜索返回两个文件:第一个是100个单词,第二个是1000个单词。每个包含您的搜索关键字一次。由于第一个文档中的关键字为文本的1%,因此短文档被认为与您的搜索更相关,而不是长文档,其中您搜索的关键字仅为文本的0.1%。



问题2:

听起来像是根据您的要求,您可能想尝试省略规范。但是,这可能会以您不期望的方式歪曲搜索结果。这可能是因为你已经从长度标准化的一些不错的特性中受益,并没有意识到它。另一种方法可能是将文档长度实际存储为某种标记字段,例如将文档标记为短,中和长,然后提取长或长,中等匹配的文档。这也会让最终用户能够在搜索时过滤文档长度。

同样,当我提到长度标准化的很好属性时,您可能会想到超长篇文章涉及10个不同主题的情况,其中1个与用户搜索或长篇文章存在,只谈论1个主题,被搜索的主题。在这种情况下,您可能更愿意将超长文章放在超长文章上(即使超长文章与搜索关键字匹配的次数更多)。这完全取决于你的数据和你的用例。

We're using Solr to search articles of various lengths. We index both descriptive metadata (title, author, category, keywords, etc) and the full article text. We do not boost relevance at index time - all boosts are done at query time (we use dismax, coupled with various qf, pf, and bf boosts).

Currently our fulltext field uses the standard omitNorms=false; and as a result, all else equal, shorter articles (2-3 column inch articles) will frequently have higher relevance than longer feature-length (multi-page) articles.

In our case article length is a significant indicator of relevance, and so I am considering setting omitNorms=true on our fulltext field.

Questions: 1. Why is the default lucene/solr behavior to boost shorter field lengths over higher? What is the reasoning? 2. Why would I not want to omitNorms? I don't need to boost queries on this particular field, nor use any kind of faceting on this field.

解决方案

Question 1:

Boosting shorter field lengths over higher field lengths has to do with a fundamental concept of determining document relevancy called TF-IDF (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf). As a short example, consider your search returned two documents: the first is 100 words and the second is 1,000 words. Each contains your search keyword just once. Since the keyword in the first document was 1% of the text, the short document is judged to be more relevant to your search than the long document, where the keyword you searched for was only 0.1% of the text.

Question 2:

It sounds like based on your requirements, you might want to try omitting norms. However, this may skew your search results in ways you don't expect. It could be that you have been benefiting from some of the nice properties of length normalization and didn't realize it. Another approach might be to actually store document length as some sort of tag field such as labeling documents as "short", "medium", and "long" and then boost documents that match on long or long and medium or whatever. This would also give your end users the ability to filter on document length when they search.

Again, when I mention nice properties of length normalization, you might think of cases where a super long article exists that touches on 10 different topics, 1 of which matches the user's search or a long article exists that talks about only 1 topic, the one that was searched for. In this case, you'd probably prefer the long article over the super long article (even if the super long article matched the search keyword more times). It all depends more on your data and your use cases.

这篇关于Solr:长度归一化/ omitNorms = false有什么好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆