如何在Lucene 7+中通过文档ID获取DocValue? [英] How to get DocValue by document ID in Lucene 7+?

查看:550
本文介绍了如何在Lucene 7+中通过文档ID获取DocValue?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要通过以下方式将DocValue添加到文档中

I'm adding a DocValue to a document with

doc.add(new BinaryDocValuesField("foo",new BytesRef("bar")));

要为ID为docId的特定文档检索该值,请致电

To retrieve that value for a specific document with ID docId, I call

DocValues.getBinary(reader,"foo").get(docId).utf8ToString();

BinaryDocValues中的get函数最多受

The get function in BinaryDocValues is supported up to Lucene 6.6, but for Lucene 7.0 and up it does not seem to be available anymore.

因此,如何在Lucene 7+中按文档ID获取DocValue(无需迭代BinaryDocValues/DocIdSetIterator,而不必重新获取BinaryDocValues和每次都使用advanceExact)?

So, how do I get the DocValue by document ID in Lucene 7+ (without having to iterate over BinaryDocValues / DocIdSetIterator, and without having to re-get BinaryDocValues and use advanceExact every time) ?

推荐答案

理论

Doc值是Lucene的列跨步字段值存储.出于面值和排序的目的,Doc值在查询时用于随机访问的速度非常快. 以下问题 LUCENE-7407 将访问模式从随机访问切换为迭代器.因为与任意随机访问API相比,迭代器API的访问模式要严格得多,所以此更改为Lucene使用主动压缩和其他优化提供了更大的自由度和功能:

Theory

Doc values are Lucene's column-stride field value storage. Doc values were intended to be quite fast for random access at query time for faceting and sorting purposes. The following issue LUCENE-7407 switches access pattern from random-access to an iterator. Because an iterator API is a much more restrictive access pattern than an arbitrary random access API, this change gives Lucene more freedom and power to use aggressive compression and other optimizations:

  • 在数据稀疏的情况下减少磁盘空间的使用
  • 即使在非稀疏情况下,压缩率和文档值解码速度也更快
  • 删除缺失值的特殊列(getDocsWithField)并线程本地编解码器阅读器

您可以在以下博客中了解有关此更改的信息:

You can read about this change in the following blogs:

  • Doc values as iterators
  • Sparse versus dense document values with Apache Lucene

在实践中,此更改在某些情况下会导致性能下降,例如 SOLR-9599 .在主要情况下(构面和排序),可以正确使用迭代API,甚至可以执行一些优化. 实际上,在很多情况下,此API并不是一个很好的解决方案.所有这些情况都被当作不正确的用法丢弃(与sun.misc.Unsafe在java单词中遇到的相同问题).

In practice this change causes performance degradation in some cases, for example SOLR-9599. In major case(faceting and sorting) an iterative API is OK with proper usage and, even more, allows to perform some optimizations. In fact there are a lot of cases where this API is not a good solution. All these cases were discarded as an incorrect usage(the same problem we had in java word with sun.misc.Unsafe).

实际上,org.apache.lucene.index.DocValuesIterator#advanceExact相当快,并且在某些实现中具有相似的性能和复杂性.

In fact, org.apache.lucene.index.DocValuesIterator#advanceExact is quite fast and has similar performance and complexity in case of some implementations.

这篇关于如何在Lucene 7+中通过文档ID获取DocValue?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆