使用python客户端弹性搜索滚动 [英] elasticsearch scrolling using python client

查看:153
本文介绍了使用python客户端弹性搜索滚动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在弹性搜索中滚动时,重要的是在每个滚动条件下提供最新的 scroll_id

When scrolling in elasticsearch it is important to provide at each scroll the latest scroll_id:


初始搜索请求和每个后续滚动请求返回
a new scroll_id - 只能使用最近的scroll_id。

The initial search request and each subsequent scroll request returns a new scroll_id — only the most recent scroll_id should be used.

以下示例(取自 here )让我失望。首先,滚动初始化:

The following example (taken from here) puzzle me. First, the srolling initialization:

rs = es.search(index=['tweets-2014-04-12','tweets-2014-04-13'], 
               scroll='10s', 
               search_type='scan', 
               size=100, 
               preference='_primary_first',
               body={
                 "fields" : ["created_at", "entities.urls.expanded_url", "user.id_str"],
                   "query" : {
                     "wildcard" : { "entities.urls.expanded_url" : "*.ru" }
                   }
               }
   )
sid = rs['_scroll_id']

然后循环:

tweets = [] while (1):
    try:
        rs = es.scroll(scroll_id=sid, scroll='10s')
        tweets += rs['hits']['hits']
    except:
        break

它有效,但我看不到其中 sid 更新...我相信它发生在内部,在python客户端;但我不明白它是如何工作的...

It works, but I don't see where sid is updated... I believe that it happens internally, in the python client; but I don't understand how it works...

推荐答案

其实代码中有一个错误 - 为了正确使用滚动功能,您应该在下一次调用scroll()时使用每个新调用返回的新scroll_id,而不是重新使用第一个:

In fact the code has a bug in it - in order to use the scroll feature correctly you are supposed to use the new scroll_id returned with each new call in the next call to scroll(), not reuse the first one:


重要

初始搜索请求和每个后续滚动请求返回
a new scroll_id - 只有最近的scroll_id应该使用。

The initial search request and each subsequent scroll request returns a new scroll_id — only the most recent scroll_id should be used.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

它的工作原理是因为Elasticsearch并不总是在调用之间更改scroll_id,而较小的结果集可以返回与原来返回一段时间相同的scroll_id 。去年的这个讨论是在另外两个用户看到同样的问题,同一个scroll_id返回一段时间:

It's working because Elasticsearch does not always change the scroll_id in between calls and can for smaller result sets return the same scroll_id as was originally returned for some time. This discussion from last year is between two other users seeing the same issue, the same scroll_id being returned for awhile:

http://elasticsearch-users.115913.n3.nabble.com/Distributing-query-results -using-scrolling-td4036726.html

所以当你的代码工作的一个较小的结果集是不正确的 - 你需要捕获scroll_id返回每个新的scroll()调用,并用于下一次调用。

So while your code is working for a smaller result set it's not correct - you need to capture the scroll_id returned in each new call to scroll() and use that for the next call.

这篇关于使用python客户端弹性搜索滚动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆