Solr:长度归一化/omitNorms=false 有什么好处? [英] Solr: What are the benefits of length normalization/omitNorms=false?

查看:25
本文介绍了Solr:长度归一化/omitNorms=false 有什么好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 Solr 搜索各种长度的文章.我们索引描述性元数据(标题、作者、类别、关键字等)和全文.我们不会在索引时提升相关性 - 所有提升都在查询时完成(我们使用 dismax,结合各种 qf、pf 和 bf 提升).

We're using Solr to search articles of various lengths. We index both descriptive metadata (title, author, category, keywords, etc) and the full article text. We do not boost relevance at index time - all boosts are done at query time (we use dismax, coupled with various qf, pf, and bf boosts).

目前我们的全文字段使用标准 omitNorms=false;因此,在所有其他条件相同的情况下,较短的文章(2-3 列英寸文章)通常比较长的专题文章(多页)文章具有更高的相关性.

Currently our fulltext field uses the standard omitNorms=false; and as a result, all else equal, shorter articles (2-3 column inch articles) will frequently have higher relevance than longer feature-length (multi-page) articles.

在我们的案例中,文章长度是相关性的重要指标,因此我正在考虑在全文字段中设置 omitNorms=true.

In our case article length is a significant indicator of relevance, and so I am considering setting omitNorms=true on our fulltext field.

问题: 1. 为什么默认的 lucene/solr 行为会增加较短的字段长度而不是较高的字段长度?理由是什么?2. 为什么我不想省略Norms?我不需要在这个特定字段上增加查询,也不需要在这个字段上使用任何类型的分面.

Questions: 1. Why is the default lucene/solr behavior to boost shorter field lengths over higher? What is the reasoning? 2. Why would I not want to omitNorms? I don't need to boost queries on this particular field, nor use any kind of faceting on this field.

推荐答案

问题 1:

将较短的字段长度提升到较高的字段长度与确定文档相关性的基本概念有关,称为 TF-IDF(参见 http://en.wikipedia.org/wiki/Tf%E2%80%93idf).作为一个简短的例子,假设您的搜索返回了两个文档:第一个是 100 个词,第二个是 1,000 个词.每个都只包含您的搜索关键字一次.由于第一个文档中的关键字占文本的 1%,因此判断短文档与您的搜索的相关性高于长文档,其中您搜索的关键字仅占文本的 0.1%.

Boosting shorter field lengths over higher field lengths has to do with a fundamental concept of determining document relevancy called TF-IDF (see http://en.wikipedia.org/wiki/Tf%E2%80%93idf). As a short example, consider your search returned two documents: the first is 100 words and the second is 1,000 words. Each contains your search keyword just once. Since the keyword in the first document was 1% of the text, the short document is judged to be more relevant to your search than the long document, where the keyword you searched for was only 0.1% of the text.

问题 2:

听起来根据您的要求,您可能想尝试省略规范.但是,这可能会以您意想不到的方式扭曲您的搜索结果.可能是您一直受益于长度归一化的一些不错的特性而没有意识到这一点.另一种方法可能是将文档长度实际存储为某种标签字段,例如将文档标记为短"、中"和长",然后提升匹配长或长和中等或其他内容的文档.这也将使您的最终用户能够在搜索时过滤文档长度.

It sounds like based on your requirements, you might want to try omitting norms. However, this may skew your search results in ways you don't expect. It could be that you have been benefiting from some of the nice properties of length normalization and didn't realize it. Another approach might be to actually store document length as some sort of tag field such as labeling documents as "short", "medium", and "long" and then boost documents that match on long or long and medium or whatever. This would also give your end users the ability to filter on document length when they search.

同样,当我提到长度归一化的好属性时,您可能会想到这样一种情况:一篇超长的文章涉及 10 个不同的主题,其中 1 个与用户的搜索相匹配,或者存在一篇只讨论 1 个主题的长文章,被搜索的那个.在这种情况下,您可能更喜欢长文章而不是超长文章(即使超长文章与搜索关键字匹配的次数更多).这一切都更多地取决于您的数据和您的用例.

Again, when I mention nice properties of length normalization, you might think of cases where a super long article exists that touches on 10 different topics, 1 of which matches the user's search or a long article exists that talks about only 1 topic, the one that was searched for. In this case, you'd probably prefer the long article over the super long article (even if the super long article matched the search keyword more times). It all depends more on your data and your use cases.

这篇关于Solr:长度归一化/omitNorms=false 有什么好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆