lucene 字段与 DocValues [英] lucene Fields vs. DocValues

查看:23
本文介绍了lucene 字段与 DocValues的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Lucene 来索引我们的数据,但我遇到了一些关于 DocValues 字段的奇怪行为.

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.

那么,谁能解释一下常规文档字段(如StringFieldTextFieldIntField 等)和DocValues 之间的区别领域(例如IntDocValuesFieldSortedDocValuesField(类型在Lucene 5.0 中似乎有变化)等)?

So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields (like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?

首先,为什么我不能使用 document.get(fieldname) 访问 DocValues?如果是这样,我如何访问它们?

First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?

其次,我看到在 Lucene 5.0 中一些特性发生了变化,例如排序只能在 DocValues 上完成......为什么会这样?

Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?

第三,DocValues 可以更新,但常规字段不能(您必须删除和添加整个文档)...

Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...

另外,也许是最重要的,我什么时候应该使用 DocValues,什么时候应该使用常规字段?

Also, and perhaps most important, when should I use DocValues and when regular fields?

约瑟夫

推荐答案

通过参考 Solr Wiki 或网络搜索可以快速回答这些问题中的大多数,但要了解 DocValues 的要点:它们对以下方面很有用除了实际搜索之外,与现代搜索服务相关的所有其他内容.来自 Solr 社区维基:

Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:

DocValues 是一种在内部记录字段值的方法,在某些用途(例如排序和分面)方面比传统索引更有效.

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

...

DocValue 字段现在是面向列的字段,具有在索引时构建的文档到值映射.这种方法有望减轻 fieldCache 的一些内存需求,并使对分面、排序和分组的查找速度更快.

DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

这也应该可以回答为什么 Lucene 5 需要 DocValues 进行排序 - 它比以前的方法效率高得多.

This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.

造成这种情况的原因是存储格式变了 与为这些操作收集数据时的标准格式不同,以前应用程序必须遍历每个文档才能找到值,现在它可以查找值并找到相应的文档.当您已经有需要对其执行交集的文档列表时,这非常有用.

The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.

如果我没记错的话,更新基于 DocValue 的字段涉及从之前的标记列表中拉出文档,然后将其重新插入到新位置,与之前的方法相比,它会改变依赖项的负载(和重新索引是唯一可行的策略).

If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).

对需要上述任何属性的字段使用 DocValues,例如排序/分面/等.

Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

这篇关于lucene 字段与 DocValues的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆