如何实现solr过滤器? [英] How solr filters actually implemented?

查看:164
本文介绍了如何实现solr过滤器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对查询处理的理解是否正确?


  1. 从缓存中获取DocSet或实现OpenBitSet或SortedVIntSet并对其进行缓存

  2. 从缓存中获取DocSet 所有其他过滤器创建它们的DocBitSet实现,它将与原始文件(这个代码的效率取决于DocSet的第一个实现的实现)我们使用Lucene过滤器+查询搜索()使用MainQuery和最终的DocSet(在所有交集之后)进行leapfrog跳转( >效率取决于第一个DocSet实现

  3. 我们将后置过滤器(cost> 100&& cache == false) li>

因此,性能将取决于第一个过滤器,因为对于小型查询,SortedIntSet更高效, BitSet比较好。
我正确吗?

问题的第二部分
DocSet有两个主要实现 - HashDocSet和SortedIntDoc,每个交集实现迭代所有实例过滤并检查它是否也在第二个DocSet中......这意味着我们必须按照大小对过滤器进行排序,最小的排在第一位。
是否可以控制缓存过滤器的顺序(成本只适用于非缓存过滤器)?

解决方案

听起来不错。有关更多信息,请查看 SolrIndexSearcher#getProcessedFilter


因此,性能将取决于第一个过滤器,因为对于小型查询,SortedIntSet更高效,对于大BitSet更好。我是否正确?


空间效率问题比速度问题更重要。排序后的int []花费4 * nDocs个字节,而一个比特集花费maxDoc / 8个字节,这就是为什么当集合中的文档数量为<
$ b


问题的第二部分:DocSet有两个主要实现 - HashDocSet和SortedIntDoc
blockquote>

SortedIntDocSet的问题在于它不支持随机访问,并且HashDocSet的问题在于它无法按顺序枚举doc ID,这对于评分很重要。这就是为什么Solr几乎在任何地方都使用SortedIntDocSets并在需要随机访问时创建一个临时HashDocSet的原因(例如,查看JoinQParserPlugin或DocSlice#intersect)。


Is my understanding of query processing correct?

  1. Get DocSet from cache or First filter query will create implementation of OpenBitSet or SortedVIntSet and cache it
  2. Get DocSet from cache or All other filters create their implementation of DocBitSet and it will be intersected with original (efficiency of this code depends on implementation of first implementation of DocSet)
  3. We do leapfrog with MainQuery and final DocSet(after all intersections) using Lucene filter+query search(efficiency of this is dependent on first DocSet implementation)
  4. We apply post filters(cost > 100 && cache==false) as AND of orignal query

So as a consequence performance will be dependent on first filter since for small query SortedIntSet is more efficient and for big BitSet is better. Am I correct?

Second part of question: DocSet has two main implementation - HashDocSet and SortedIntDoc, each intersection implementation iterates over all instances in first filter and check if it is also in second DocSet... That means we have to sort filters by size, smallest first. Is it possible to control order of cached filters(cost only works for non cached filters)?

解决方案

It sounds good. For more information, have a look at SolrIndexSearcher#getProcessedFilter.

So as a consequence performance will be dependent on first filter since for small query SortedIntSet is more efficient and for big BitSet is better. Am I correct?

This is more a problem of space efficiency than a problem of speed. A sorted int[] costs 4 * nDocs bytes while a bit set costs maxDoc / 8 bytes, this is why Solr uses sorted int[] whenever the number of documents in the set is < maxDoc / 32.

Second part of question: DocSet has two main implementation - HashDocSet and SortedIntDoc

The problem with SortedIntDocSet is that it doesn't support random access, and the problem with HashDocSet is that it can't enumerate doc IDs in order, which can be important for scoring. This is why Solr uses SortedIntDocSets almost everywhere and creates a transient HashDocSet whenever it needs random access (look at JoinQParserPlugin or DocSlice#intersect for example).

这篇关于如何实现solr过滤器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆