有效地获取弹性搜索索引中的所有文档 [英] efficiently getting all documents in an elasticsearch index

查看:159
本文介绍了有效地获取弹性搜索索引中的所有文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从弹性搜索集群中的匹配全部查询中获取所有结果。我不在乎结果是否是最新的,我不在乎订单,我只是想稳定地继续经历所有的结果,然后从头开始。是滚动和扫描最好的这个,似乎有点命中拍摄我不需要的快照。

I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.

推荐答案

有一些重复的弹性搜索查询返回所有记录。但是我们可以添加更多的细节来解决开销问题。 (例如,似乎有点命中拍摄我不需要的快照。)

Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")

A 滚动扫描搜索绝对是你想要的。
快照在这里并不算太多。该文档将其描述为像时间快照(强调添加))。实际的实施细节有点更微妙,很聪明。

A scroll-scan search is definitely what you want in this case. The "snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.

稍后更详细的说明在文档中:

A slightly more detailed explanation comes later in the documentation:


通常,后台合并过程通过将较小的段合并在一起来创建新的更大的段来优化索引,此时较小的段被删除。此过程在滚动期间继续,但打开的搜索上下文可防止旧段在其仍在使用中被删除。这是Elasticsearch如何能够返回初始搜索请求的结果,而不管文档的后续更改如何。

Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.

所以原因上下文是廉价的保存是因为Lucene索引段的行为如何。 Lucene索引被划分成多个段,每个段都像一个独立的迷你索引。随着文档的添加(和更新),Lucene只是附加一个新的段到索引。分段是一次写入:创建后,它们将永远不会被更新。

So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.

随着时间的推移,随着段累积,Lucene将在后台定期进行一些内务处理。它扫描片段并合并片段以刷新已删除和过时的信息,最终整合为较小的一组更新鲜和更新的片段。随着新的合并部分取代旧的部分,Lucene将会去除并删除该指数不再被主动使用的任何细分。

Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.

这种分段索引设计是一个原因为什么Lucene比一个简单的B树更具性能和弹性。从长远来看,连续附加段比直接在磁盘上更新文件的累积IO更便宜。此外,一次写入设计还有其他有用的属性。

This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.

Elasticsearch使用的类似快照的行为是保持对所有活动时段的引用,滚动搜索开始。所以开销是最小的:一些引用一些文件。此外,或许,这些文件在磁盘上的大小,因为索引随着时间的推移而更新。

The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.

这个可能是昂贵的开销,如果磁盘空间是服务器上的严重问题,那么。可以想象,当滚动搜索上下文活动时,索引被足够快地更新,索引的磁盘大小可能会增加一倍。为此,确保您有足够的能力,使索引可能增长到其预期大小的2-3倍是有帮助的。

This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.

这篇关于有效地获取弹性搜索索引中的所有文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆