弹性搜索:当我的三个节点中的两个下降了吗? [英] elasticsearch: Did I lose data when two of my three nodes went down?

查看:275
本文介绍了弹性搜索:当我的三个节点中的两个下降了吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:当我的节点B和C失败时,我丢失了数据?


$($)

b $ b

3节点集群:节点:A,B,C



A是主(首先设置,制定出来)。相关配置(在所有节点上,发生了什么事情是B丢失了网络访问并且已经关闭,结果是C错误地设置为number_of_replicas:1)

  node.master:true 
node.data:true
index.number_of_shards:5
index.number_of_replicas:2

在A上,当其他两个节点关闭时,我注意到unassigned_shards为6.由于我的分片计数为5,这意味着我有一个问题:

 #curl -XGET http:// localhost:9200 / _cluster / health?pretty = true 
{
cluster_name:elasticsearch-PROD-prod,
status:red,
timed_out:false,
number_of_nodes :1,
number_of_data_nodes:1,
active_primary_shards:4,
active_shards:4,
relocating_shards:0,
initializing_shards :0,
unassigned_shards:6,
delayed_unassigned_shards:0,
number_of_pending_tasks:0,
number_of_in_flight_fetch:0
}

果然,在下面的分片列表中是一个主分片(#1),这是UNASSIGNED

 #curl -XGET http:// localhost:9200 / _cat / shards 
index_v3_PROD 4 p开始22578283 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 4 r UNASSIGNED
index_v3_PROD 0 p起始22572884 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 0 r UNASSIGNED
index_v3_PROD 3 p开始22579159 12.8gb 10.208.131.56 PROD-node-3a
index_v3_PROD 3 r UNASSIGNED
index_v3_PROD 1 p UNASSIGNED
index_v3_PROD 1 r UNASSIGNED
index_v3_PROD 2 p开始22580877 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 2 r UNASSIGNED

<斯特龙g>注意上面,shard 1是p,是UNASSIGNED。这对我来说看起来很可怕!



然后,我使用一个重新路由命令将其分配给它,它是做到的。

  curl -XPOST'localhost:9200 / _cluster / reroute'-d'{commands:[{
allocate:{
index:index_v3_PROD,
shard:1,
node:PROD-node-3a,
allow_primary:true
}
}
]
}'

但是碎片1开始于非常小的尺寸,然后种类(我认为从新数据发送到ES)。我有一个强烈的感觉,碎片1的数据丢失了。



有人可以确认碎片1数据是否有疑似/丢失(或不)?

解决方案

Andrei将此作为评论发布,而不是答案,所以我将:



是数据丢失了。



对于具有allow_primary的主分片的重新路由命令:true,并且该分片不可用将从头开始分片,为空。您应该尽可能地将该节点带回群集中。






这里的脚注是:我们不知道如何带回群集节点。其他2个节点显示此状态:

 #curl -XGET http:// localhost:9200 / _cluster / health?pretty = true 
{
error:MasterNotDiscoveredException [等待[30s]],
status:503
}

我们找不到任何使用的诊断步骤,日志显示我们没有任何用处。两个节点都报告了这个问题(共有3个,所以所需的分片的唯一副本是其中之一)。



我们验证了全方位的网络通信,并重新启动其他节点,但它们没有附加到集群。



最终,我们设置了一个新鲜的3节点集群,并且更好地确保每个节点节点存在每个分片,使得集群可以承受丢失2个节点。


Elasticsearch 1.7.2 on CentOS

The question: When my nodes B and C went down, did I lose data?

3 node cluster: Nodes: A, B, C

A is master (was set up first, worked out that way). Relevant config (on all nodes, however what happened was the B lost network access and went down, and it turned out that C incorrectly was set to number_of_replicas: 1)

node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 2

On A, while those other two nodes were down, I notice that the "unassigned_shards" is 6. Since my shard count is 5, that implies to me that I have a problem:

# curl -XGET http://localhost:9200/_cluster/health?pretty=true
{
  "cluster_name" : "elasticsearch-PROD-prod",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 4,
  "active_shards" : 4,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 6,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Sure enough, on the shard list below, there is a primary shard (#1) that is UNASSIGNED

# curl -XGET http://localhost:9200/_cat/shards
index_v3_PROD 4 p STARTED    22578283 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 4 r UNASSIGNED                                   
index_v3_PROD 0 p STARTED    22572884 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 0 r UNASSIGNED                                   
index_v3_PROD 3 p STARTED    22579159 12.8gb 10.208.131.56 PROD-node-3a
index_v3_PROD 3 r UNASSIGNED                                   
index_v3_PROD 1 p UNASSIGNED                                   
index_v3_PROD 1 r UNASSIGNED                                   
index_v3_PROD 2 p STARTED    22580877 12.7gb 10.208.131.56 PROD-node-3a
index_v3_PROD 2 r UNASSIGNED                                   

Notice above that shard 1 is "p" and is UNASSIGNED. This looks scary to me!

I then used a reroute command to assign it over to A, which it did.

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands" : [ {
              "allocate" : {
                  "index" : "index_v3_PROD", 
                  "shard" : 1, 
                  "node" : "PROD-node-3a", 
                  "allow_primary" : true
              }
            }
        ]
    }'

But shard 1 started at a very small size and then kind of grew (I think from new data being sent to ES). I have a strong feeling that shard 1 data was lost.

Can someone confirm whether shard 1 data looks suspect/lost (or not)?

解决方案

Andrei posted this as a comment, not as an answer, so I will:

Yes, data was lost.

A reroute command for a primary shard that has "allow_primary": true and that shard is not available will start the shard from scratch, empty. You should have made everything possible to bring that node back in the cluster.


The footnote here is: We did not know what to do to bring back a cluster node. The other 2 nodes showed this status:

# curl -XGET http://localhost:9200/_cluster/health?pretty=true
{
  "error" : "MasterNotDiscoveredException[waited for [30s]]",
  "status" : 503
}

We could not find any diag steps to use, and the logs showed us nothing useful. Two nodes both reported this issue (there are 3 total, so the only copy of the needed shard was on one of those two).

We verified all-way network communication, and rebooted the other nodes, but they did not attach to the cluster.

Ultimately, we set up a fresh 3 node cluster, and are being better about ensuring that every node has every shard present, so that the cluster could withstand losing 2 nodes.

这篇关于弹性搜索:当我的三个节点中的两个下降了吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆