Elasticsearch-如果词频较高,则得分较高 [英] Elasticsearch - higher scoring if higher frequency of term

查看:444
本文介绍了Elasticsearch-如果词频较高,则得分较高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个文档,并且正在搜索关键字"Twitter".假设两个文档都是带有标签"字段的博客帖子.

文档A在标签"字段中只有1个术语,即"Twitter". 文档B在标签"字段中有100个术语,但其中3个是"Twitter".

即使文档B的出现频率较高,弹性搜索"也会为文档A赋予较高的分数.但是该分数是稀释的",因为它具有更多的用语.由于文档B的搜索词出现频率较高,如何给它更高的分数?

我知道ElasticSearch/Lucene根据文档中的术语数量执行一些标准化.如何禁用此规范化,以便文档B的得分更高?

解决方案

另一个答案说,看看您在单个分片上是否有相同的结果会很有趣.我想您会并且这取决于标签字段的规范,在使用tf/idf相似度(默认值)计算分数时会考虑到这一点.

事实上,lucene确实考虑了术语频率,换句话说,术语在字段中出现的次数(在您的情况下为1或3),以及倒置的文档频率,换句话说,术语如何索引中的频率很高,以便将其与查询中的其他术语进行比较(对于您而言,如果您搜索单个术语,则没有任何区别).

还有另一个称为标准的因素,它奖励较短的字段并考虑到最终的索引时间提升,这可以针对每个字段(在映射中)甚至针对每个文档.您可以验证规范是您的结果启用搜索请求中的explain选项并查看explain输出的原因.

我猜测第一个文档仅包含该标签的事实使得包含该标签的其他文档多次包含许多其他标签变得更加重要.如果您不喜欢这种行为,则可以在标签字段的映射中禁用规范.如果该字段为"index":"analyzed"(默认值),则默认情况下应启用它.如果您不希望分析标签字段(通常是合理的,但取决于您的数据和域),则可以切换到"index":"not_analyzed",或者在标签字段的映射中添加"omit_norms": true选项.

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.

Document A has ONLY 1 term in the "tags" field, and it's "Twitter". Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".

Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?

I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?

解决方案

As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).

In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).

But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.

I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.

这篇关于Elasticsearch-如果词频较高,则得分较高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆