弹性搜索滚动行为 [英] Elastic Search Scroll Behaviour

查看:32
本文介绍了弹性搜索滚动行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了Elastic Search中的滚动功能,这看起来很有趣.我浏览了许多文档,但仍然不清楚下面的问题.

I came across scroll functionality in Elastic Search and this is looking pretty interesting. I gone through so many documents but still below questions are not clear to me.

  1. 如果偏移量已经存在,那么为什么要使用滚动?
  2. 接下来的记录呢?假设它完成了所有数据的滚动,然后几秒钟后新数据进入了索引,那么它将如何工作?它还会滚动以获取新记录,例如流媒体吗?
  3. 假设由于服务器负载或互联网问题导致连接断开,那么它将从头开始滚动数据吗?

所有这些问题都与将数据从旧索引重新索引到新索引有关.

All these questions are in context of re-indexing data from old index to new index.

推荐答案

由于最近我也对此进行了一些研究,因此我将尝试给出一些信息:

I will try and give some info on this as I too have recently done some research into that:

如果偏移量已经存在,那么为什么要使用滚动?

If offset is already there then why to use scroll?

我不确定是否可以结合使用滚动和偏移量.但是我相信主要的区别是偏移查询将为您提供假"结果.如果为False,它将正确执行查询,但是请考虑介于两者之间的所有更新.在重新索引方面,这将是错误的,因为您可能会丢失数据.想象一下,您进行了10k个结果的偏移量查询,然后花2分钟来处理它.您可能会在2分钟内更新对象(或插入内容).这意味着将查询偏移10k可能最终指向跳过中间几行的结果,或者指向已经存在的结果(想象之间的删除).但是Scroll保证保持搜索上下文有效,并以清晰,严格的方式返回结果,其中不会考虑任何更新.

I am not sure if you can use scroll in combination with offsets. But I believe the main difference would be that an offset query will give you "false" results. False in terms of it will execute your query correctly, however consider all updates in between. In terms of reindexing, this would be wrong as you are at risk to loose data. Imagine you do an offset query of 10k results, and then taking 2 minutes to process it. You might have updates to your objects (or inserts) within the 2 minutes. That means that offsetting your query by 10k might end up pointing to a result skipping a few rows in between, or to a result that already has been there (imagine deletion in between). Scroll however guarantees to keep the search context alive and return results in a clear and strict way, where no updates will be considered.

我认为所需的行为可以通过如下所示的恒定排序+进行搜索来实现:

I think the required behaviour could be implemented with a constant sorting + a search after as documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html This should make the results stable (in turn of the cursor pointing to the offset being correct) however it would still consider all changes (I think) that happen between 2 requests.

我想通过更改配置(例如,logstash)开始将正确的文档插入到新索引中,然后滚动所有旧数据以将其重新索引到新索引中,从而可以进行重新索引.通过使用滚动,您仍然可以使用旧数据,而更改不会影响您的重新索引操作.

I would imagine re-indexing would happen by changing your config (say logstash) to start inserting the correct documents into the new index, and then doing a scroll over ALL old data to reindex it into the new index. By using scroll, you would be able to still work with that old data while the changes would not affect your reindex operation.

文档:

尽管搜索请求返回的结果只有一个页面",但滚动API可以用于检索大量结果(甚至所有结果)来自一个搜索请求,方式与您大致相同将在传统数据库上使用游标.

While a search request returns a single "page" of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

下一步:

接下来的记录呢?假设它完成了所有数据的滚动然后几秒钟后,新数据进入索引,然后如何将工作?它还会滚动以获取新记录,例如流媒体吗?

What about upcoming records? Suppose it finished to scroll all data and then after few seconds new data came into the index, then how it will work? will it scroll to get new records also, like streaming?

滚动将保留在第一个滚动请求上创建的结果.这是通过拍摄快照并防止将更改发布到特定滚动条来完成的.文件:

Scrolling will preserve the result it created on the first scroll request. This is done by taking a snapshot and preventing changes to be published to the specific scroll. Docs:

从滚动请求返回的结果反映了状态发出初始搜索请求时的索引数,像是及时的快照.随后对文档进行的更改(索引,更新或删除)只会影响以后的搜索请求.

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

第三:

假设由于服务器负载或互联网问题导致连接中断,那么它将从头开始滚动数据吗?

Suppose connection is broken because of server load or internet issue, then will it start scrolling data from starting?

这无关紧要.Scroll带有附件,例如 POST/twitter/tweet/_search?scroll = 1m ,其中分配 1m 向Elasticsearch表示ES服务器中的搜索上下文可以保留多长时间.这意味着,如果您的连接中断,您要做的就是拿起您的滚动ID并使用它来创建一个新请求.ES会将该ID与现有搜索上下文进行匹配,并为您提供预期的结果.文件:

This does not matter. Scroll comes with an assigment, e.g. POST /twitter/tweet/_search?scroll=1m where the assignment, 1m, indicates to elasticsearch how long the search context is kept alive withing the ES server. This means, if your connection breaks, all you need to do is to pick up your scroll id and use this to create a new request. ES will match that id to the existing search context and give you the expected results. Docs:

为了使用滚动,初始搜索请求应指定查询字符串中的scroll参数,它告诉Elasticsearch它应保持搜索上下文"有效的时间(请参阅保持搜索上下文),例如?scroll = 1m.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the "search context" alive (see Keeping the search context alive), eg ?scroll=1m.

通常,所有这些信息都可以在这里找到: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Generally, all that information can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

希望这会有所帮助,

Artur

这篇关于弹性搜索滚动行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆