Solr中的docValue是什么?我什么时候应该使用它们? [英] What are docValues in Solr? When should I use them?

查看:165
本文介绍了Solr中的docValue是什么?我什么时候应该使用它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我已经阅读了多个资料,试图解释Solr中的"docValues"是什么,但是我似乎不明白何时应该使用它们,尤其是在索引字段与存储字段之间.谁能给我一些启示吗?

So, I have read multiple sources that try to explain what 'docValues' are in Solr, but I don't seem to understand when I should use them, especially in relation to indexed vs stored fields. Can anyone please throw some light on it?

推荐答案

Solr中的docValue是什么?

What are docValues in Solr ?

Doc值可以解释为Lucene的列跨步字段值存储,也可以解释为未反转的索引或前向索引.

Doc values can be explained as Lucene's column-stride field value storage or simply its an uninverted index or forward index.

以json进行说明:

  • 面向行(存储的字段)

  • row-oriented (stored fields)


{
'doc1': {'A':1, 'B':2, 'C':3},
'doc2': {'A':2, 'B':3, 'C':4},
'doc3': {'A':4, 'B':3, 'C':2}
}

面向列(docValues)

column-oriented (docValues)


{
'A': {'doc1':1, 'doc2':2, 'doc3':4},
'B': {'doc1':2, 'doc2':3, 'doc3':3},
'C': {'doc1':3, 'doc2':4, 'doc3':2}
}

使用DocValues吗?

Purpose of DocValues ?

存储的字段以大步方式将一个文档的所有字段值存储在一起.在检索中,每个文档一次返回所有字段值,因此加载文档的相关信息非常快.

Stored fields store all field values for one document together in a row-stride fashion. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast.

但是,如果您需要扫描某个字段(进行分面/排序/分组/突出显示),这将是一个缓慢的过程,因为您将不得不遍历所有文档并在每次迭代中加载每个文档的字段,从而导致磁盘搜索

However, if you need to scan a field (for faceting/sorting/grouping/highlighting) it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration resulting in disk seeks.

例如,进行排序,当找到所有匹配的文档时,Lucene需要获取每个文档的字段值.类似地,例如,构面引擎必须查找将构成结果集的每个文档中出现的每个术语,并提取文档ID以构建构面列表.

For example, sorting, when all the matched documents are found, Lucene need to get the value of a field of each of them. Similarly the faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.

现在可以通过两种方式解决此问题:

Now this problem can be approached in two ways:

  • 使用现有的存储字段.在这种情况下,如果您开始在给定的字段上进行排序/汇总,数据将被懒惰地反转,并在搜索时放入fieldCache中,以便您可以访问给定文档ID的值.此过程占用大量CPU和I/O.
  • DocValue在搜索时非常快速地访问,因为它们以跨步存储,因此每次命中只需要解码该字段的值.这种方法有望减轻fieldCache的一些内存需求,并使查找面,排序和分组的查找变得更快.

就像倒排索引docvalue被序列化到磁盘一样,在这种情况下,我们可以依靠操作系统的文件系统缓存来管理内存,而不是在JVM堆上保留结构.

Like inverted index docvalues are serialized to disk in that case we can rely on the OS’s file system cache to manage memory instead of retaining structures on the JVM heap.

我什么时候应该使用它们?

When should I use them ?

出于上述所有原因. 如果您的内存不足,或者不需要为字段建立索引,则DocValues非常适合进行构面/分组/过滤/排序/函数查询.它们还具有增加您可以在不增加内存需求的情况下进行多方面/分组/过滤/排序的字段数的潜力.我一直在生产Solr中使用docvalues进行排序和构面,并且看到这些查询的性能有了很大的提高.

For all the reasons discussed above. If you are in a low-memory environment, or you don’t need to index a field, DocValues are perfect for faceting/grouping/filtering/sorting/function queries. They also have the potential for increasing the number of fields you can facet/group/filter/sort on without increasing your memory requirements. I've been using docvalues in production Solr for sorting and faceting and have seen a huge improvement in performance of these queries.

这篇关于Solr中的docValue是什么?我什么时候应该使用它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆