碎片和复制品在弹性搜索 [英] Shards and replicas in Elasticsearch

查看：79 发布时间：2017/8/6 22:24:49 elasticsearch full-text-search

本文介绍了碎片和复制品在弹性搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在试图理解弹性搜索中的碎片和复制品，但是我并没有理解它。如果我下载Elasticsearch并运行脚本，那么从我所知道的，我已经启动了一个单个节点的集群。现在这个节点（我的PC）有5个分片（？）和一些副本（？）。

他们是什么，我有5个重复的索引？如果是这样的话？我可能需要一些解释。

解决方案

我会尝试解释一个真正的例子，因为答案和答复你没有似乎可以帮助你。

当您下载弹性搜索并启动它时，您将创建一个弹性搜索节点，尝试加入现有的集群（如果可用）或创建一个新的。假设您使用单个节点创建了自己的新集群，即刚启动的节点。我们没有数据，因此我们需要创建一个索引。

创建索引时（索引第一个文档时会自动创建索引），您可以定义它将由多少个分片组成。如果不指定一个数字，它将具有默认的分片数量：5个基数。这是什么意思？

这意味着弹性搜索会创建5个主要的碎片，其中包含您的数据：

  ____ ____ ____ ____ ____ 
 | 1 | | 2 | | 3 | | 4 | | 5 | 
 | ____ | | ____ | | ____ | | ____ | | ____ |

每次索引一个文档，弹性搜索将决定哪个主分片应该保存该文档并将索引在那里主碎片不是数据的副本，它们是数据！有多个分片确实有助于在单个机器上利用并行处理，但总而言之，如果我们在同一个集群上启动另一个弹性搜索实例，则分片将以均匀的方式分布在集群上。

节点1将仅举行三个分片：

  ____ ____ ____ 
 | 1 | | 2 | | 3 | 
 | ____ | | ____ | | ____ |

由于剩下的两个分片已被移动到新开始的节点：

  ____ ____ 
 | 4 | | 5 | 
 | ____ | | ____ |

为什么会发生这种情况？因为弹性搜索是一个分布式搜索引擎，这样你就可以利用多个节点/机器来管理大量的数据。

每个弹性搜索索引都由至少一个原始碎片，因为这是存储数据的地方。每个分片都是有代价的，因此，如果你有一个节点，并且没有可预见的增长，那么只需坚持一个主分片。

另一种类型的分片是复制品。默认值为1，这意味着每个主分片将被复制到将包含相同数据的另一个分片。副本用于提高搜索性能和故障转移。复制分片永远不会被分配到相关主节点的同一个节点上（它几乎就像将备份放在与原始数据相同的磁盘上）。

回到我们的示例，使用1个副本，我们将在每个节点上拥有整个索引，因为3个副本分片将在第一个节点上分配，并且它们将包含与第二个节点上的基数完全相同的数据：

  ____ ____ ____ ____ ____ 
 | 1 | | 2 | | 3 | | 4R | | 5R | 
 | ____ | | ____ | | ____ | | ____ | | ____ |

第二个节点相同，其中将包含第一个节点上主碎片的副本： / p>

  ____ ____ ____ ____ ____ 
 | 1R | | 2R | | 3R | | 4 | | 5 | 
 | ____ | | ____ | | ____ | | ____ | | ____ |

使用这样的设置，如果一个节点掉线，您仍然拥有整个索引。复制分片将自动成为主要的，并且尽管节点出现故障，集群将正常工作，如下所示：

  ____ ____ ____ ____ ____ 
 | 1 | | 2 | | 3 | | 4 | | 5 | 
 | ____ | | ____ | | ____ | | ____ | | ____ |

由于您有number_of_replicas：1 ，则不能再分配副本，因为它们从未分配在主要副本的同一节点上。这就是为什么你会有5个未分配的分片，副本，集群状态将是 YELLOW 而不是 GREEN 。没有数据丢失，但是可能会因为某些分片不能被分配而变得更好。

一旦剩下的节点被备份，它将再次加入集群并重新分配副本。可以加载第二个节点上的现有分片，但是它们需要与其他分片同步，因为写入操作最有可能在节点关闭时发生。在此操作结束时，集群状态将变为 GREEN 。

希望这可以为您解决问题。 / p>

I am trying to understand what shard and replica is in Elasticsearch, but I don't manage to understand it. If I download Elasticsearch and run the script, then from what I know I have started a cluster with a single node. Now this node (my PC) have 5 shards (?) and some replicas (?).

What are they, do I have 5 duplicates of the index? If so why? I could need some explanation.

解决方案

I'll try to explain with a real example since the answer and replies you got don't seem to help you.

When you download elasticsearch and start it up you create an elasticsearch node which tries to join an existing cluster if available or creates a new one. Let's say you created your own new cluster with a single node, the one that you just started up. We have no data, therefore we need to create an index.

When you create an index (an index is automatically created when you index the first document as well) you can define how many shards it will be composed of. If you don't specify a number it will have the default number of shards: 5 primaries. What does it mean?

It means that elasticsearch will create 5 primary shards that will contain your data:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

Every time you index a document elasticsearch will decide which primary shard is supposed to hold that document and will index it there. Primary shards are not a copy of the data, they are the data! Having multiple shards does help taking advantage of parallel processing on a single machine, but the whole point is that if we start another elasticsearch instance on the same cluster, the shards will be distributed in an even way over the cluster.

Node 1 will then hold for example only three shards:

 ____    ____    ____ 
| 1  |  | 2  |  | 3  |
|____|  |____|  |____|

Since the remaining two shards have been moved to the newly started node:

 ____    ____
| 4  |  | 5  |
|____|  |____|

Why does this happen? Because elasticsearch is a distributed search engine and this way you can make use of multiple nodes/machines to manage big amounts of data.

Every elasticsearch index is composed of at least one primary shard since that's where the data is stored. Every shard comes at a cost, though, therefore if you have a single node and no foreseeable growth, just stick with a single primary shard.

Another type of shard is a replica. The default is 1, meaning that every primary shard will be copied to another shard that will contain the same data. Replicas are used to increase search performance and for fail-over. A replica shard is never going to be allocated on the same node where the related primary is (it would pretty much be like putting a backup on the same disk as the original data).

Back to our example, with 1 replica we'll have the whole index on each node, since 3 replica shards will be allocated on the first node and they will contain exactly the same data as the primaries on the second node:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4R |  | 5R |
|____|  |____|  |____|  |____|  |____|

Same for the second node, which will contain a copy of the primary shards on the first node:

 ____    ____    ____    ____    ____
| 1R |  | 2R |  | 3R |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

With a setup like this, if a node goes down you still have the whole index. The replica shards will automatically become primaries and the cluster will work properly despite the node failure, as follows:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

Since you have "number_of_replicas":1, the replicas cannot be assigned anymore as they are never allocated on the same node where their primary is. That's why you'll have 5 unassigned shards, the replicas, and the cluster status will be YELLOW instead of GREEN. No data loss, but it could be better as some shards cannot be assigned.

As soon as the node that had left is back up, it'll join the cluster again and the replicas will be assigned again. The existing shard on the second node can be loaded but they need to be synchronized with the other shards, as write operations most likely happened while the node was down. At the end of this operation, the cluster status will become GREEN.

Hope this clarifies things for you.

这篇关于碎片和复制品在弹性搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

碎片和复制品在弹性搜索 [英] Shards and replicas in Elasticsearch

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

碎片和复制品在弹性搜索 [英] Shards and replicas in Elasticsearch

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭