Elasticsearch 中的分片和副本 [英] Shards and replicas in Elasticsearch

查看:70
本文介绍了Elasticsearch 中的分片和副本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解 Elasticsearch 中的分片和副本是什么,但我没能理解.如果我下载 Elasticsearch 并运行脚本,那么据我所知,我已经启动了一个具有单个节点的集群.现在这个节点(我的 PC)有 5 个分片(?)和一些副本(?).

I am trying to understand what shard and replica is in Elasticsearch, but I didn't manage to understand it. If I download Elasticsearch and run the script, then from what I know I have started a cluster with a single node. Now this node (my PC) have 5 shards (?) and some replicas (?).

它们是什么,我有 5 个重复的索引吗?如果是为什么?我可能需要一些解释.

What are they, do I have 5 duplicates of the index? If so why? I could need some explanation.

推荐答案

我会试着用一个真实的例子来解释,因为你得到的答案和回复似乎对你没有帮助.

I'll try to explain with a real example since the answer and replies you got don't seem to help you.

当您下载并启动 elasticsearch 时,您将创建一个 elasticsearch 节点,该节点尝试加入现有集群(如果可用)或创建一个新集群.假设您使用单个节点(您刚刚启动的节点)创建了自己的新集群.我们没有数据,因此需要创建索引.

When you download elasticsearch and start it up, you create an elasticsearch node which tries to join an existing cluster if available or creates a new one. Let's say you created your own new cluster with a single node, the one that you just started up. We have no data, therefore we need to create an index.

创建索引时(索引第一个文档时也会自动创建索引),您可以定义它将由多少个分片组成.如果您不指定数字,它将具有默认的分片数:5 个主要分片.这是什么意思?

When you create an index (an index is automatically created when you index the first document as well) you can define how many shards it will be composed of. If you don't specify a number it will have the default number of shards: 5 primaries. What does it mean?

这意味着 elasticsearch 将创建 5 个包含您的数据的主分片:

It means that elasticsearch will create 5 primary shards that will contain your data:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

每次你索引一个文档时,elasticsearch 将决定哪个主分片应该保存该文档并将其索引到那里.主分片不是数据的副本,它们是数据!拥有多个分片确实有助于在单台机器上利用并行处理的优势,但重点是如果我们在同一个集群上启动另一个 elasticsearch 实例,分片将以均匀的方式分布在集群上.

Every time you index a document, elasticsearch will decide which primary shard is supposed to hold that document and will index it there. Primary shards are not a copy of the data, they are the data! Having multiple shards does help taking advantage of parallel processing on a single machine, but the whole point is that if we start another elasticsearch instance on the same cluster, the shards will be distributed in an even way over the cluster.

然后节点 1 将只保存三个分片:

Node 1 will then hold for example only three shards:

 ____    ____    ____ 
| 1  |  | 2  |  | 3  |
|____|  |____|  |____|

由于剩余的两个分片已经移动到新启动的节点:

Since the remaining two shards have been moved to the newly started node:

 ____    ____
| 4  |  | 5  |
|____|  |____|

为什么会这样?因为 elasticsearch 是一个分布式搜索引擎,这样你就可以利用多个节点/机器来管理大量数据.

Why does this happen? Because elasticsearch is a distributed search engine and this way you can make use of multiple nodes/machines to manage big amounts of data.

每个 elasticsearch 索引至少由一个主分片组成,因为这是存储数据的地方.但是,每个分片都是有代价的,因此如果您只有一个节点并且没有可预见的增长,请坚持使用一个主分片.

Every elasticsearch index is composed of at least one primary shard since that's where the data is stored. Every shard comes at a cost, though, therefore if you have a single node and no foreseeable growth, just stick with a single primary shard.

另一种类型的分片是副本.默认值为 1,这意味着每个主分片都将复制到另一个包含相同数据的分片.副本用于提高搜索性能和故障转移.副本分片永远不会分配在相关主分片所在的同一节点上(这几乎就像将备份放在与原始数据相同的磁盘上).

Another type of shard is a replica. The default is 1, meaning that every primary shard will be copied to another shard that will contain the same data. Replicas are used to increase search performance and for fail-over. A replica shard is never going to be allocated on the same node where the related primary is (it would pretty much be like putting a backup on the same disk as the original data).

回到我们的例子,如果有 1 个副本,我们将在每个节点上拥有整个索引,因为将在第一个节点上分配 2 个副本分片,并且它们将包含与第二个节点上的主分片完全相同的数据:

Back to our example, with 1 replica we'll have the whole index on each node, since 2 replica shards will be allocated on the first node and they will contain exactly the same data as the primary shards on the second node:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4R |  | 5R |
|____|  |____|  |____|  |____|  |____|

第二个节点相同,它将包含第一个节点上主分片的副本:

Same for the second node, which will contain a copy of the primary shards on the first node:

 ____    ____    ____    ____    ____
| 1R |  | 2R |  | 3R |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

有了这样的设置,如果一个节点出现故障,你仍然拥有整个索引.尽管节点故障,副本分片将自动成为主分片,集群将正常工作,如下所示:

With a setup like this, if a node goes down, you still have the whole index. The replica shards will automatically become primaries and the cluster will work properly despite the node failure, as follows:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

由于您有 "number_of_replicas":1,因此无法再分配副本,因为它们永远不会分配在其主要节点所在的同一节点上.这就是为什么您将有 5 个未分配的分片、副本和集群状态将是 YELLOW 而不是 GREEN.没有数据丢失,但可能会更好,因为某些分片无法分配.

Since you have "number_of_replicas":1, the replicas cannot be assigned anymore as they are never allocated on the same node where their primary is. That's why you'll have 5 unassigned shards, the replicas, and the cluster status will be YELLOW instead of GREEN. No data loss, but it could be better as some shards cannot be assigned.

一旦离开的节点被备份,它将再次加入集群并再次分配副本.可以加载第二个节点上的现有分片,但它们需要与其他分片同步,因为写操作最有可能在节点关闭时发生.此操作结束时,集群状态将变为GREEN.

As soon as the node that had left is backed up, it'll join the cluster again and the replicas will be assigned again. The existing shard on the second node can be loaded but they need to be synchronized with the other shards, as write operations most likely happened while the node was down. At the end of this operation, the cluster status will become GREEN.

希望这能为您澄清事情.

Hope this clarifies things for you.

这篇关于Elasticsearch 中的分片和副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆