Solr 中的 docValues 是什么?我应该什么时候使用它们? [英] What are docValues in Solr? When should I use them?

查看:24
本文介绍了Solr 中的 docValues 是什么?我应该什么时候使用它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我已经阅读了多个试图解释 Solr 中docValues"是什么的来源,但我似乎不明白什么时候应该使用它们,尤其是与索引字段和存储字段相关的内容.任何人都可以请说明一下吗?

So, I have read multiple sources that try to explain what 'docValues' are in Solr, but I don't seem to understand when I should use them, especially in relation to indexed vs stored fields. Can anyone please throw some light on it?

推荐答案

Solr 中的 docValues 是什么?

What are docValues in Solr ?

Doc 值可以解释为 Lucene 的 column-stride 字段值存储,或者只是它的非倒排索引或前向索引.

Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.

用json来说明:

  • 面向行(存储字段)

  • row-oriented (stored fields)


{
'doc1': {'A':1, 'B':2, 'C':3},
'doc2': {'A':2, 'B':3, 'C':4},
'doc3': {'A':4, 'B':3, 'C':2}
}

面向列(docValues)

column-oriented (docValues)


{
'A': {'doc1':1, 'doc2':2, 'doc3':4},
'B': {'doc1':2, 'doc2':3, 'doc3':3},
'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

DocValues 的用途?

Purpose of DocValues ?

存储字段以行步长的方式将一个文档的所有字段值存储在一起.在检索中,每个文档一次返回所有字段值,因此加载有关文档的相关信息非常快.

Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.

但是,如果您需要扫描一个字段(用于分面/排序/分组/突出显示),这将是一个缓慢的过程,因为您必须遍历所有文档并每次迭代加载每个文档的字段,从而导致磁盘搜索.

However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.

比如排序,当找到所有匹配的文档时,Lucene需要获取每个文档的一个字段的值.类似地,例如,分面引擎必须查找出现在构成结果集的每个文档中的每个术语,并提取文档 ID 以构建分面列表.

For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.

现在可以通过两种方式解决这个问题:

Now this problem can be approached in two ways:

  • 使用现有的存储字段.在这种情况下,如果您开始对给定字段进行排序/聚合,数据将被延迟反转并在搜索时放入 fieldCache,以便您可以访问给定文档 ID 的值.此过程非常占用 CPU 和 I/O.
  • DocValues 在搜索时访问速度非常快,因为它们是按列存储的,因此每次命中只需要解码该字段的值.这种方法有望减轻 fieldCache 的一些内存需求,并使对分面、排序和分组的查找速度更快.

就像倒排索引文档值被序列化到磁盘一样,在这种情况下,我们可以依靠操作系统的文件系统缓存来管理内存,而不是在 JVM 堆上保留结构.

Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.

我应该什么时候使用它们?

When should I use them ?

出于上述所有原因.如果您处于低内存环境,或者不需要索引字段,DocValues 非常适合分面/分组/过滤/排序/函数查询.它们还有可能增加您可以分面/分组/过滤/排序的字段数量,而不会增加您的内存需求.我一直在生产 Solr 中使用 docvalues 进行排序和分面,并且看到这些查询的性能有了巨大的提高.

For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.

这篇关于Solr 中的 docValues 是什么?我应该什么时候使用它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆