Lucene fieldNorm相似度计算和查询时间值之间存在差异 [英] Lucene fieldNorm discrepancy between Similarity calculation and query-time value

查看：128 发布时间：2018/8/2 13:12:47 lucene indexing metrics

本文介绍了Lucene fieldNorm相似度计算和查询时间值之间存在差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图理解如何计算 fieldNorm （在索引时），然后在查询时使用（并且明显重新计算）。

I'm trying to understand how fieldNorm is calculated (at index time) and then used (and apparentlly re-calculated) at query time.

在所有示例中，我使用的是StandardAnalyzer，没有停用词。

In all the examples I'm using the StandardAnalyzer with no stop words.

推卸 DefaultSimilarity 的'code> computeNorm 方法在索引内容时，我注意到它会返回2个特定文档：

Deugging the DefaultSimilarity's computeNorm method while indexing stuff, I've noticed that for 2 particular documents it returns:

0.5表示文件A（其字段中有4个令牌）

0.70710677表示文件B（其字段中有2个令牌）

它通过使用以下公式来实现：

It does this by using the formula:

state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

其中，提升总是1

之后，当我查询这些文件时，我看到在查询说明中我得到了

Afterwards, when I query for these documents I see that in the query explain I get

0.5 = fieldNorm（field = titre，doc = 0） for document A

0.625 = fieldNorm（field = titre，doc = 1）对于文件B

0.5 = fieldNorm(field=titre, doc=0) for document A
0.625 = fieldNorm(field=titre, doc=1) for document B

这已经很奇怪了（对我来说，我确定是我缺少了什么）。为什么我不能获得与索引时计算的字段规范相同的值？这是查询规范化的事情吗？如果是这样，它是如何工作的？

This is already strange (to me, I'm sure it's me who's missing something). Why don't I get the same values for field norm as those calculated at index time? Is this the "query normalization" thing in action? If so, how does it work?

然而这或多或少都可以，因为两个查询时间fieldNorms提供的顺序与索引时计算的顺序相同（字段在这两种情况下，较短的值都有较高的fieldNorm）

This however is more or less ok since the two query-time fieldNorms give the same order as those calculated at index time (the field with the shorter value has the higher fieldNorm in both cases)

然后我创建了自己的Similarity类，我已经实现了computeNorms方法，如下所示：

I've then made my own Similarity class where I've implemented the computeNorms method like so:

public float computeNorm(String pField, FieldInvertState state) {
    norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength())));
    return norm;
}

在索引时我现在得到：

1.5文档A（其字段中有4个令牌）

1.7071068，用于文档B（其字段中有2个令牌）

但是现在，当我查询这些文档时，我可以看到它们都具有与explain函数报告的字段规范相同的字段规范：

However now, when I query for these documents, I can see that they both have the same field norm as reported by the explain function:

1.5 = fieldNorm（field = titre，doc = 0） for document A

1.5 = fieldNorm（field = titre，doc = 1） for document B

1.5 = fieldNorm(field=titre, doc=0) for document A
1.5 = fieldNorm(field=titre, doc=1) for document B

对我来说，现在真的很奇怪，如果我在索引时使用一个明显好的相似度来计算fieldNorm，为什么会给我一个与令牌数成比例的正确值，稍后，在查询时，所有这些都丢失了，查询sais两个文件都有相同的字段规范？

To me, this is now really strange, how come if I use an apparently good similarity to calculate the fieldNorm at index time, which gives me proper values proportional to the number of tokens, later on, at query time, all this is lost and the query sais both documents have the same field norm?

所以我的问题是：

为什么索引时间fieldNorm由相似度的computeNorm方法与查询解释报告的方法不一致吗？

为什么，对于在索引时获得的两个不同的fieldNorm值（通过相似度computeNorm），我在查询时得到相同的fieldNorm值？

==更新

好的，我在< a href =http://lucene.apache.org/core/3_6_2/api/core/org/apache/lucene/search/Similarity.html#formula_norm =nofollow> Lucene的文档澄清了我的一些问题，但不是全部：

Ok, I've found something in Lucene's docs which clarifies some of my question, but not all of it:

然而，在存储之前，生成的标准值被编码为单个字节。在搜索时，从索引目录中读取范数字节值并将其解码回浮点范数值。这种编码/解码虽然减小了索引大小，但却带来了精度损失的代价 - 无法保证解码（encode（x））= x。例如，decode（encode（0.89））= 0.75。

有多少精度损失？我们应该在不同的值之间存在最小间隙，以便即使在精确损失重新计算之后它们仍然不同吗？

How much precision loss is there? Is there a minimum gap we should put between different values so that they remain different even after the precision-loss re-calculations?

Lucene fieldNorm相似度计算和查询时间值之间存在差异 [英] Lucene fieldNorm discrepancy between Similarity calculation and query-time value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Lucene fieldNorm相似度计算和查询时间值之间存在差异 [英] Lucene fieldNorm discrepancy between Similarity calculation and query-time value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭