Elasticsearch:过滤器的顺序,以获得最佳性能 [英] Elasticsearch: Order of filters for best performance

查看:134
本文介绍了Elasticsearch:过滤器的顺序,以获得最佳性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Elasticsearch指南说


每个过滤器都是独立计算和缓存的,不管它在哪里使用,如果两个不同的查询使用相同的过滤器,相同的过滤器位组将被重用。同样,如果单个查询在多个位置使用相同的过滤器,则只计算一个位组,然后重新使用。
https://www.elastic.co/guide/en /elasticsearch/guide/current/filter-caching.html


在另一个页面上也说:


bool子句中的过滤器顺序对于性能很重要,更具体的过滤器应放在特定于过滤器之前,以排除为尽可能多的文件,如果条款A可以匹配1000万个文件,而条款B只能匹配100个文件,那么条款B应该放在第A款之前。
https://www.elastic.co/guide/en/flasticsearch /guide/current/_filter_order.html


我不太明白bool子句中的过滤器顺序每个过滤器独立缓存时都很重要。



我可以想象,从缓存中执行或检索条款B,从缓存中执行或检索条款A,然后合并滤波器位。为什么订单很重要?

解决方案

这个指导有点误导。它更复杂,很难尝试编写一套适合所有情况的规则。随着数据的变化,规则发生变化。随着查询和过滤器类型的改变,规则也会改变。规则更改时,特定的过滤器可能比较宽的执行速度更慢。在每个片段上,过滤器的结果大小可能与另一个片段相反,但并不总是可预测的。首先,您必须了解更多的内部,那么在进入现代Elasticsearch 2.x时,您需要放开控制。


$ b $注意: 您的第二个报价(过滤器顺序)和相关联的链接是对于Elasticsearch 2.x被认为是过期的页面,它会稍后更新。因此,建议可能适用于或不适用于现代。



回顾Elasticsearch 1.x的时间和订购建议的原因:



让我们先谈谈过滤器在内存中的表现。它们是匹配文档的迭代列表,或者是随机访问在这里模型。根据过滤器的类型,取决于哪个更有效。现在,如果一切都被缓存,你只是与它们相交,成本会因大小和类型而异。



如果过滤器没有被缓存,但是可缓存,那么过滤器将执行独立的和以前的过滤器只会影响交叉口的总成本。



如果过滤器不可缓存,那么可以由先前的结果指导。想象一下查询加一个过滤器。如果您执行查询,并且在应用过滤器之后,如果过滤器限制为非常小的记录集,那么您正在做很多额外的工作。您通过收集,评分和整体构建大量结果浪费了查询中的时间。但是,如果您转换为 FilteredQuery 并同时执行,那么查询将忽略已被消除的所有记录过滤器。它只需要考虑已经有相同的文件。这被称为跳过。不是所有的过滤器类型都有跳过的优势,但有些可以。这就是为什么一个较小的引导过滤器会使其他人更快地使用它。



除非您知道每个过滤器类型,数据的启发式以及每个特定过滤器将如何受到每个过滤器的影响,否则您只能拥有足够的信息说em首先放置大部分限制性的过滤器,然后再放大数量的,希望它能奏效。对于 bool ,默认值不是缓存其整体结果,因此您必须注意其重复的性能(和/或缓存)。当滤波器交叉点的一侧较小时,效率更高。所以有一个小的开始,使所有其他交叉点更快,因为他们只能变小。如果它是一个 bool 查询而不是一个过滤器进行评分,更重要的是避免打分更多的文档必要。



另一个重要的注意事项是,最具体的过滤器首先有时可能很慢(脚本过滤器或其他),所以它应该是阅读:最低成本,最具特色的过滤器。



使用 Elasticsearch 2.0,事情会改变


现在是时候忘记所有关于查询和过滤器的信息:Elasticsearch 2.0将自己做出更好的决策,而不是依靠用户来制定优化的查询。


在2.x中,您应该尽量少玩游戏系统,让引擎做出最佳选择。引擎实际上可能会在引擎盖下完成不同的内容,重写过滤器,内部结构和数据的完整更改。你甚至可能甚至不能控制缓存。所以你需要阅读更多关于这一点。


以前的过滤器API可以通过两种方式使用:使用迭代器匹配文档,或使用允许的可选随机访问API检查特定文档是否与过滤器匹配。一切都是好的,除了最好的方式来使用过滤器取决于您使用哪种过滤器:例如,脚本过滤器在使用random-访问API,而 bool 过滤器使用iterator API更有效率。这是一个非常噩梦来进行优化,一方面是为什么 bool 过滤器和之间的根本原因,


引擎现在将决定最重要的是考虑更多的因素,包括评分,结果尺寸估算,相关过滤器交叉的最佳方式,甚至可能在每个细分的基础上,还有更多。



另外这篇文章清楚地表明,即使缓存可能会导致误导,这并不总是使事情更快。有时,最初使用的内部数据结构比始终缓存的bitset结构更好。所以也在2.x中,这是改变,以避免缓存从本机数据结构更好地执行而不缓存的东西。



在博客文章中咆哮位图是更多的细节:


显然,最重要的要求是快速做到:如果您的缓存过滤器比执行速度慢再次,过滤器不仅消耗内存,而且使您的查询更慢。编码越复杂,因为CPU使用率的增加而减缓编码和解码的可能性越大。


关于内部数据结构,缓存,交叉点的更多信息,更多关于2.x内部变化的信息,这将有助于您更深入地了解过滤器性能。


虽然如果您是搜索引擎内部的新手,它可能会让您感到惊讶,搜索引擎最重要的构建块之一就是能够有效地压缩并快速解码排序的整数列表。 / p>

从最近几个2.x博客链接你有很多关于你的问题的背景,他们谈论你所有的问题试图解决过滤器排序问题。信息和细节都在这里,您可以更好地了解1.x与2.x以及查询+过滤器如何解决。所以记住:


没有任何特定的实现会比所有其他的更好。


另请参阅这些1.x资源以供参考:




  • 优化弹性搜索搜索涵盖了有关过滤器排序的更多信息。它总结说:


    那就是说,你仍然需要考虑你过滤的顺序。你想要更有选择性的过滤器来运行第一。说你按类型过滤:book和tag:elasticsearch。如果您有3000万个文档,1000万个类型的书籍,只有10个标签的Elasticsearch,您将需要首先应用标签过滤器。它比书籍筛选器减少了文档数量。



  • 所有关于Elasticsearch Filter Bitsets 被认为是现代的过时文章,但它给出了您引用的过滤器排序文档的更多背景。 / p>


  • Martijn的论坛答案v格罗宁根州似乎就与 bool 查询有关使用迭代与随机的查询相反但是这个想法对于每个都是一样的:通过在过滤器列表中较早的限制文档是安全的 - 无论哪种型号与一种类型相比。



The Elasticsearch guide says

"Each filter is calculated and cached independently, regardless of where it is used. If two different queries use the same filter, the same filter bitset will be reused. Likewise, if a single query uses the same filter in multiple places, only one bitset is calculated and then reused." (https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html)

on another page it also says:

"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A." (https://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html)

I do not quite understand how the order of filters in a bool clause is important when each filter is cached independently.

I would imagine that Clause B is executed or retrieved from the cache, Clause A is executed or retrieved from the cache and then the filter bitsets are 'merged'. Why would the order matter?

解决方案

This guidance is a little misleading. It is more complicated and it is very hard to try to write one set of rules that fits all situations. As data changes, the rules change. As query and filter types change, the rules change. A specific filter might be slower to execute than a broad one, the rules change. On a per segment basis the result size of a filter might be opposite than on another segment, it isn't always predictable. So first you have to understand more of the internals, then you need to let go of trying to control it as you move into modern Elasticsearch 2.x.

NOTE: your second quote (filter order) and associated link is to a page that is considered "out of date" for Elasticsearch 2.x, it will be updated later. Therefore the advice may or may not apply to modern times.

Looking back in time to Elasticsearch 1.x and the reason for the ordering suggestion:

Let's talk first about how filters are represented in memory. They are either an iterated list of matching documents, or a random access "is it here" model. Depending on the type of filter, depends on which is more efficient. Now if everything is cached, you are just intersecting them and the cost will vary by size and type.

If filters are not cached, but are cacheable then a filter will execute independently and the previous filters will only affect it by the total cost of intersection.

If the filter is NOT cacheable then it COULD be guided by the previous results. Imagine a Query plus a Filter. If you execute the query, and after apply the filter, you are doing a lot of extra work if the filter limits to a very small set of records. You wasted time in the query with collecting, scoring, and overall building a big set of results. But if you convert to a FilteredQuery and do both at the same time, then the Query ignores all records already eliminated by the Filter. It only has to consider the same documents already in play. This is called "skipping". Not all filter types take advantage of skipping, but some can. And this is why a smaller "guiding" filter will make others using it faster.

Unless you know each filter type, the heuristics of your data, and how each specific filter will be affected by each of these, you just do not have enough information other than to say "put most limiting filters first, and larger ones second" and hope it works out. For bool the default is not to cache its overall result so you have to pay attention to its repeated performance (and/or cache it). It is more efficient when one side of the filter intersection is small. So having a small one to start with makes all the other intersections faster because they can only get smaller. If it were a bool query instead of a filter doing scoring it is even more important to avoid scoring more documents than necessary.

One other important note is that "most specific filter first" sometimes can be slow (script filter, or other), so it should really read: "lowest cost, most specific filters first".

With Elasticsearch 2.0, things will change:

It’s time to forget everything you knew about queries and filters: Elasticsearch 2.0 will make much better decisions by itself instead of relying on users to formulate an optimized query.

In 2.x you should try less to game the system, and let the engine make the best choices. The engine actually may end up with something quite different under the hood, a rewritten filter, a complete change in internal structure and data. And you may not even control the caching anymore. So you need to read more about that.

The previous filter API could be consumed in two ways: either using iterators over matching documents, or using an optional random-access API that allowed to check if a particular document matched the filter or not. Everything is good so far, except that the best way to consume a filter depended on which kind of filter you had: for instance the script filter was more efficient when using the random-access API while the bool filter was more efficient using the iterator API. This was quite a nightmare to optimize and was the root cause why the bool filter on the one hand and the and and or filters on the other hand performed differently.

The engine will now decide what is best taking more factors into consideration including scoring, estimation of result size, best way to intersect related filters, maybe even on a per segment basis, and more.

Also this article makes it clear that even caching can be misleading, it doesn't always make things faster. Sometimes an internal data structure is better when originally used, than the bitset structure that is always cached. So also in 2.x this is changing to avoid caching things that execute better from the native data structure without caching at all.

In the blog post Roaring Bitmaps are more details:

Clearly the most important requirement is to have something fast: if your cached filter is slower than executing the filter again, it is not only consuming memory but also making your queries slower. The more sophisticated an encoding is, the more likely it is to slow down encoding and decoding because of the increased CPU usage

Here you get a lot of information about the internal data structures, caching, intersection and more on the internal changes in 2.x which will help you to have more depth in your understanding of filter performance.

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers.

From these last few 2.x blog links you have a lot of background about your question, they talk about all of the issues you are trying to work around with filter ordering. The information and details are all there and you can have a better understanding of 1.x vs. 2.x and how queries+filters are solved. So remember:

There is no particular implementation which is constantly better than all others.

See also these 1.x resources for historical reference:

  • Optimizing Elasticsearch searches covers a bit more about filter ordering. It says in summary:

    That said, you still need to think about which order you filter in. You want the more selective filters to run first. Say you filter on type: book and tag: elasticsearch. If you have 30 million documents, 10 million of type book and only 10 tagged Elasticsearch, you’ll want to apply the tag filter first. It reduces the number of documents much more than the book filter does.

  • All About Elasticsearch Filter Bitsets is considered an obsolete article for modern times, but it gives more background about the filter ordering document you quoted.

  • A forum answer by Martijn v Groningen seems to say the opposite about bool vs. and queries about which uses iteration vs. random access, but the idea is the same for each: be safe by limiting documents earlier in the filter list -- regardless of which model is for one type versus the other.

这篇关于Elasticsearch:过滤器的顺序,以获得最佳性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆