了解 Elasticsearch 中的分段 [英] Understanding Segments in Elasticsearch

查看:86
本文介绍了了解 Elasticsearch 中的分段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我假设 Elasticsearch 中的每个分片都是一个索引.但我在某处读到每个段都是一个 Lucene 索引.

I was under the assumption that each shard in Elasticsearch is an index. But I read somewhere that each segment is a Lucene index.

究竟什么是细分?它如何影响搜索性能?使用默认 Elasticsearch 设置,我每天的索引大小达到 450GB 左右(我每天创建一个新的).

What exactly is a segment? How does it effect search performance? I have indices that reach around 450GB in size everyday (I create a new one everyday) with default Elasticsearch settings.

当我执行 curl -XPOST "http://localhost:9200/logstash-2013.03.0$i_optimize?max_num_segments=1" 时,我得到num_committed_segments=11num_search_segments=11.

When I execute curl -XPOST "http://localhost:9200/logstash-2013.03.0$i_optimize?max_num_segments=1", I get num_committed_segments=11 and num_search_segments=11.

上面的值不应该是1吗?也许是因为 index.merge.policy.segments_per_tier 值?这层到底是什么?

Shouldn't the above values be 1? Maybe it's because of index.merge.policy.segments_per_tier value? What is this tier anyway?

推荐答案

索引"这个词在 Elasticsearch 中被滥用了一点——适用于太多东西.

The word "index" gets abused a bit in Elasticsearch -- applies to too many things.

解释一下:

Elasticsearch 中的索引"有点像关系数据库中的数据库.这是您存储/索引数据的地方.但实际上,这正是您的应用程序所看到的.在内部,索引是指向一个或多个分片的逻辑命名空间.

An "index" in Elasticsearch is a bit like a database in a relational DB. It's where you store/index your data. But actually, that's just what your application sees. Internally, an index is a logical namespace that points to one or more shards.

此外,索引"意味着将您的数据放入"Elasticsearch.您的数据被存储(用于检索)和索引"以用于搜索.

Also, "to index" means to "put" your data into Elasticsearch. Your data is both stored (for retrieval) and "indexed" for search.

倒排索引"是 Lucene 用来使数据可搜索的数据结构.它处理数据,提取唯一的术语或标记,然后记录哪些文档包含这些标记.请参阅 http://en.wikipedia.org/wiki/Inverted_index 了解更多信息.

An "inverted index" is the data structure that Lucene uses to make data searchable. It processes the data, pulls out unique terms or tokens, then records which documents contain those tokens. See http://en.wikipedia.org/wiki/Inverted_index for more.

分片"是 Lucene 的一个实例.它本身就是一个功能齐全的搜索引擎.索引"可以由单个分片组成,但通常由多个分片组成,以允许索引增长并在多台机器上拆分.

A "shard" is an instance of Lucene. It is a fully functional search engine in its own right. An "index" could consist of a single shard, but generally consists of several shards, to allow the index to grow and to be split over several machines.

主分片"是文档的主要位置.副本分片"是主分片的副本,它提供 (1) 在主分片死机的情况下提供故障转移和 (2) 增加读取吞吐量

A "primary shard" is the main home for a document. A "replica shard" is a copy of the primary shard that provides (1) failover in case the primary dies and (2) increased read throughput

每个分片包含多个段",其中一个段是倒排索引.分片中的搜索将依次搜索每个段,然后将它们的结果组合成该分片的最终结果.

Each shard contains multiple "segments", where a segment is an inverted index. A search in a shard will search each segment in turn, then combine their results into the final results for that shard.

当您为文档编制索引时,Elasticsearch 会在内存中(以及在事务日志中,为了安全起见)收集它们,然后每隔一秒左右将一个新的小段写入磁盘,并刷新"搜索.

While you are indexing documents, Elasticsearch collects them in memory (and in the transaction log, for safety) then every second or so, writes a new small segment to disk, and "refreshes" the search.

这使得新段中的数据对搜索可见(即它们是可搜索的"),但该段尚未同步到磁盘,因此仍存在数据丢失的风险.

This makes the data in the new segment visible to search (ie they are "searchable"), but the segment has not been fsync'ed to disk, so is still at risk of data loss.

每隔一段时间,Elasticsearch 就会刷新",这意味着对段进行 fsync(它们现在已提交")并清除事务日志,这不再需要,因为我们知道新数据已被写入磁盘.

Every so often, Elasticsearch will "flush", which means fsync'ing the segments, (they are now "committed") and clearing out the transaction log, which is no longer needed because we know that the new data has been written to disk.

分段越多,每次搜索所需的时间就越长.因此,Elasticsearch 将通过后台合并过程将许多大小相似的段(层")合并为一个更大的段.一旦写入新的更大的段,旧的段就会被删除.当相同大小的片段太多时,这个过程会在更大的片段上重复.

The more segments there are, the longer each search takes. So Elasticsearch will merge a number of segments of a similar size ("tier") into a single bigger segment, through a background merge process. Once the new bigger segment is written, the old segments are dropped. This process is repeated on the bigger segments when there are too many of the same size.

段是不可变的.当一个文档被更新时,它实际上只是将旧文档标记为已删除,并索引一个新文档.合并过程还会删除这些旧的已删除文档.

Segments are immutable. When a document is updated, it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents.

这篇关于了解 Elasticsearch 中的分段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆