lucene字段与DocValues [英] lucene Fields vs. DocValues

查看:132
本文介绍了lucene字段与DocValues的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lucene并对其进行索引,我遇到了一些与DocValues字段有关的奇怪行为.

I'm using and playing with Lucene to index our data and I've come across some strange behaviors concerning DocValues Fields.

因此,任何人都可以请您解释常规文档字段(例如 StringField TextField IntField 等)与DocValues之间的区别领域 (例如 IntDocValuesField SortedDocValuesField (Lucene 5.0中的类型似乎有所变化)等)?

So, Could anyone please just explain the difference between a regular Document field (like StringField, TextField, IntField etc.) and DocValues fields (like IntDocValuesField, SortedDocValuesField (the types seem to have change in Lucene 5.0) etc.) ?

首先,为什么我不能使用 document.get(fieldname)访问DocValues?如果是这样,我该如何访问它们?

First, why can't I access DocValues using document.get(fieldname)? if so, how can I access them?

第二,我已经看到在Lucene 5.0中某些功能已更改,例如只能在DocValues上进行排序...为什么这样做?

Second, I've seen that in Lucene 5.0 some features are changed, for example sorting can only be done on DocValues... why is that?

第三,可以更新DocValues,但是常规字段不能(您必须删除并添加整个文档)...

Third, DocValues can be updated but regular fields cannot (you have to delete and add the whole document)...

也是(也许是最重要的)什么时候应该使用DocValues,什么时候需要常规字段?

Also, and perhaps most important, when should I use DocValues and when regular fields?

约瑟夫

推荐答案

可以通过参考Solr Wiki或网络搜索来快速回答大多数问题,但要获得DocValues的要旨:它们对于以下内容很有用:与现代搜索服务相关联的所有其他内容(实际搜索除外).从 Solr社区Wiki :

Most of these questions are quickly answered by either referring to the Solr Wiki or to a web search, but to get the gist of DocValues: they're useful for all the other stuff associated with a modern Search service except for the actual searching. From the Solr Community Wiki:

DocValues是一种内部记录字段值的方法,对于某些目的(例如排序和构面),它比传统索引更有效.

DocValues are a way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

...

DocValue字段现在是面向列的字段,具有在索引时建立的文档到值的映射.这种方法有望减轻fieldCache的某些内存需求,并使查找面,排序和分组的查找变得更快.

DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.

这也应该回答为什么Lucene 5需要DocValues进行排序-比以前的方法效率更高.

This should also answer why Lucene 5 requires DocValues for sorting - it's a lot more efficient than the previous approach.

这样做的原因是存储格式为在为这些操作收集数据时,从标准格式转过来.在这种情况下,应用程序以前必须遍历每个文档来查找值,现在它可以查找值并查找对应的文档.当您已经具有执行相交所需的文档列表时,这将非常有用.

The reason for this is that the storage format is turned around from the standard format when gathering data for these operations, where the application previously have to go through each document to find the values, it can now look up the values and find the corresponding documents instead. Which is very useful when you already have a list of documents that you need to perform an intersection on.

如果我没记错的话,更新基于DocValue的字段涉及将文档从先前的令牌列表中拉出,然后将其重新插入到新位置,这与之前的方法会更改依赖关系的负载(和重新索引是唯一可行的策略.

If I remember correctly, updating a DocValue-based field involves yanking the document out from the previous token list, and then re-inserting it into the new location, compared to the previous approach where it would change loads of dependencies (and reindexing was the only viable strategy).

将DocValues用于需要上述任何属性的字段,例如排序/构面/等.

Use DocValues for fields that need any of the properties mentioned above, such as sorting / faceting / etc.

这篇关于lucene字段与DocValues的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆