Solr 中的 docValue 是什么?我应该什么时候使用它们? [英] What are docValues in Solr? When should I use them?

查看:27
本文介绍了Solr 中的 docValue 是什么?我应该什么时候使用它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我阅读了多个资源,试图解释 Solr 中的docValues"是什么,但我似乎不明白什么时候应该使用它们,尤其是与索引字段和存储字段有关的情况.任何人都可以解释一下吗?

So, I have read multiple sources that try to explain what 'docValues' are in Solr, but I don't seem to understand when I should use them, especially in relation to indexed vs stored fields. Can anyone please throw some light on it?

推荐答案

什么是 Solr 中的 docValues?

What are docValues in Solr ?

Doc 值可以解释为 Lucene 的 column-stride 字段值存储,也可以简单地解释为它的未反转索引或正向索引.

Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.

用json来说明:

  • 面向行(存储字段)

  • row-oriented (stored fields)


{
'doc1': {'A':1, 'B':2, 'C':3},
'doc2': {'A':2, 'B':3, 'C':4},
'doc3': {'A':4, 'B':3, 'C':2}
}

面向列(docValues)

column-oriented (docValues)


{
'A': {'doc1':1, 'doc2':2, 'doc3':4},
'B': {'doc1':2, 'doc2':3, 'doc3':3},
'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

DocValues 的用途?

Purpose of DocValues ?

存储字段以行跨度的方式将一个文档的所有字段值存储在一起.在检索中,每个文档一次返回所有字段值,因此加载有关文档的相关信息非常快.

Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.

但是,如果您需要扫描一个字段(用于分面/排序/分组/突出显示),这将是一个缓慢的过程,因为您必须遍历所有文档并在每次迭代时加载每个文档的字段,从而导致磁盘寻道.

However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.

例如排序,当找到所有匹配的文档时,Lucene需要获取每个文档的一个字段的值.类似地,例如,分面引擎必须查找每个文档中出现的每个术语,这些词将构成结果集并提取文档 ID 以构建分面列表.

For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.

现在可以通过两种方式解决这个问题:

Now this problem can be approached in two ways:

  • 使用现有的存储字段.在这种情况下,如果您开始对给定字段进行排序/聚合,则数据将被延迟反转并在搜索时放入 fieldCache 中,以便您可以访问给定文档 ID 的值.此过程非常占用 CPU 和 I/O.
  • 在搜索时访问 DocValue 的速度非常快,因为它们是按列存储的,因此每次命中只需解码该字段的值.这种方法有望减轻 fieldCache 的一些内存需求,并使查找分面、排序和分组的速度更快.

就像倒排索引文档值被序列化到磁盘一样,在这种情况下,我们可以依靠操作系统的文件系统缓存来管理内存,而不是在 JVM 堆上保留结构.

Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.

我应该什么时候使用它们?

When should I use them ?

出于上述所有原因.如果您处于内存不足的环境中,或者您不需要索引字段,则 DocValues 非常适合分面/分组/过滤/排序/函数查询.它们也有可能在不增加内存需求的情况下增加您可以分面/分组/过滤/排序的字段数量.我一直在生产 Solr 中使用 docvalues 进行排序和分面,并且已经看到这些查询的性能有了巨大的改进.

For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.

这篇关于Solr 中的 docValue 是什么?我应该什么时候使用它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆