相似度计算和查询时间值之间的 Lucene fieldNorm 差异 [英] Lucene fieldNorm discrepancy between Similarity calculation and query-time value

查看:12
本文介绍了相似度计算和查询时间值之间的 Lucene fieldNorm 差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何计算 fieldNorm(在索引时),然后在查询时使用(并且显然是重新计算).

I'm trying to understand how fieldNorm is calculated (at index time) and then used (and apparentlly re-calculated) at query time.

在所有示例中,我都使用没有停用词的 StandardAnalyzer.

In all the examples I'm using the StandardAnalyzer with no stop words.

在索引内容时调试 DefaultSimilaritycomputeNorm 方法,我注意到它返回的 2 个特定文档:

Deugging the DefaultSimilarity's computeNorm method while indexing stuff, I've noticed that for 2 particular documents it returns:

  • 文档 A 为 0.5(其字段中有 4 个标记)
  • 文档 B 为 0.70710677(其字段中有 2 个标记)

它通过使用公式来做到这一点:

It does this by using the formula:

state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

boost 始终为 1

where boost is always 1

之后,当我查询这些文档时,我看到在查询说明中我得到了

Afterwards, when I query for these documents I see that in the query explain I get

  • 0.5 = fieldNorm(field=titre, doc=0) 用于文档 A
  • 0.625 = fieldNorm(field=titre, doc=1) 用于文档 B
  • 0.5 = fieldNorm(field=titre, doc=0) for document A
  • 0.625 = fieldNorm(field=titre, doc=1) for document B

这已经很奇怪了(对我来说,我确定是我错过了一些东西).为什么我得到的字段规范值与索引时计算的值不同?这是正在实施的查询规范化"吗?如果是这样,它是如何工作的?

This is already strange (to me, I'm sure it's me who's missing something). Why don't I get the same values for field norm as those calculated at index time? Is this the "query normalization" thing in action? If so, how does it work?

但这或多或少是可以的,因为两个查询时 fieldNorm 的顺序与索引时计算的顺序相同(在这两种情况下,值较短的字段具有较高的 fieldNorm)

This however is more or less ok since the two query-time fieldNorms give the same order as those calculated at index time (the field with the shorter value has the higher fieldNorm in both cases)

然后我创建了自己的 Similarity 类,在其中实现了 computeNorms 方法,如下所示:

I've then made my own Similarity class where I've implemented the computeNorms method like so:

public float computeNorm(String pField, FieldInvertState state) {
    norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength())));
    return norm;
}

在索引时间我现在得到:

At index time I now get:

  • 文档 A 为 1.5(其字段中有 4 个标记)
  • 文档 B 的 1.7071068(其字段中有 2 个标记)

但是现在,当我查询这些文档时,我可以看到它们都具有与解释函数报告的相同的字段规范:

However now, when I query for these documents, I can see that they both have the same field norm as reported by the explain function:

  • 1.5 = fieldNorm(field=titre, doc=0) 用于文档 A
  • 1.5 = fieldNorm(field=titre, doc=1) 用于文档 B
  • 1.5 = fieldNorm(field=titre, doc=0) for document A
  • 1.5 = fieldNorm(field=titre, doc=1) for document B

对我来说,这现在真的很奇怪,如果我在索引时使用一个明显很好的相似性来计算 fieldNorm,这给了我与令牌数量成正比的适当值,稍后,在查询时,所有这些丢失并且查询说两个文档具有相同的字段规范?

To me, this is now really strange, how come if I use an apparently good similarity to calculate the fieldNorm at index time, which gives me proper values proportional to the number of tokens, later on, at query time, all this is lost and the query sais both documents have the same field norm?

所以我的问题是:

  • 为什么 Similarity 的 computeNorm 方法报告的索引时间 fieldNorm 与查询解释报告的不一样?
  • 为什么,对于在索引时获得的两个不同 fieldNorm 值(通过相似度计算标准),我在查询时得到相同的 fieldNorm 值?

== 更新

好的,我在 Lucene 的文档 澄清了我的一些问题,但不是全部:

Ok, I've found something in Lucene's docs which clarifies some of my question, but not all of it:

但是,生成的规范值在存储之前被编码为单个字节.在搜索时,从索引目录中读取标准字节值并将其解码回浮点标准值.这种编码/解码虽然减少了索引大小,但也带来了精度损失——不能保证 decode(encode(x)) = x.例如,decode(encode(0.89)) = 0.75.

有多少精度损失?我们应该在不同的值之间设置一个最小差距,以便即使在重新计算精度损失后它们仍然保持不同?

How much precision loss is there? Is there a minimum gap we should put between different values so that they remain different even after the precision-loss re-calculations?

推荐答案

encodeNormValue 描述了编码步骤(这是丢失精度的地方),尤其是值的最终表示:

The documentation of encodeNormValue describes the encoding step (which is where the precision is lost), and particularly the final representation of the value:

编码使用 3 位尾数、5 位指数和 15 处的零指数点,因此表示从大约 7x10^9 到 2x10^-9 的值,具有大约一位有效十进制数字的精度.零也被表示.负数四舍五入为零.太大而无法表示的值会向下舍入到最大的可表示值.太小而无法表示的正值会四舍五入到可表示的最小正值.

The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.

了解尾数只有 3 位的最相关部分,这意味着精度约为一位有效十进制数字.

The most relevant piece to understand that that the mantissa is only 3 bits, which means precision is around one significant decimal digit.

关于基本原理的重要说明是在您的报价结束后的几句话,Lucene 文档说:

An important note on the rationale comes a few sentences after where your quote ended, where the Lucene docs say:

支持对规范值进行这种有损压缩的基本原理是,考虑到用户通过查询表达其真实信息需求的困难(和不准确),只有大的差异才重要.

The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

这篇关于相似度计算和查询时间值之间的 Lucene fieldNorm 差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆