Cassandra主机在群集中具有空ID [英] Cassandra host in cluster with null ID

查看:323
本文介绍了Cassandra主机在群集中具有空ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我们在Cassandra 2.1.12.1047(DSE 4.8.4)群集中看到这个问题,在3个地区(每个地区2个)中有6个节点。

Note: We are seeing this issue in our Cassandra 2.1.12.1047 (DSE 4.8.4) cluster with 6 nodes across 3 regions (2 in each region).

最近尝试更新集群中的模式时,我们发现更新失败。我们怀疑集群中的一个节点不接受更改。

Trying to update schemas on our cluster recently, we found the updates were failing. We suspected one node in the cluster was not accepting the change.

当检查 system.peers 我们的服务器在us-east-1,它有一个异常,它似乎是一个完整的条目,不存在的主机。

When checking the system.peers table of one of our servers in us-east-1, that it had an anomaly, it had what seemed to be a complete entry for a host that does not exist.

cassandra@cqlsh> SELECT peer, host_id FROM system.peers WHERE peer IN ('54.158.22.187', '54.196.90.253');

peer          | host_id
---------------+--------------------------------------
54.158.22.187 | 8ebb7f2c-8f81-44af-814b-a537b84834e0

由于该主机不存在,我试图删除它使用 nodetool removevenode 但是失败错误:无法删除self
- StackTrace -
java.lang.UnsupportedOperationException:Can not删除自我

As that host did not exist, I tried to remove it using nodetool removenode but that failed error: Cannot remove self -- StackTrace -- java.lang.UnsupportedOperationException: Cannot remove self

我们知道 .187 服务器突然终止了几个星期之前由于EC2问题。

We know that the .187 server was abruptly terminated a few weeks ago due to an EC2 issue.

我们曾试图使服务器健康,但最终只是终止了报告此<$ c的服务器$ c> .187 托管在 system.peers 中,运行 nodetool removevenode 其他服务器之一,然后带来了一个新的服务器在线。

We had numerous attempts at trying to make the server healthy, but then in the end simply terminated the server that was reporting this .187 host in the system.peers, ran a nodetool removenode from one of the other servers, and then brought a new server online.

新服务器上线,在一个小时左右似乎已经赶上了积压活动需要与其他服务器(基于完全基于CPU监控的假设)联机。

The new server came online, and in an hour or so seemed to have caught up on the backlog of activity needed to bring it inline with the other servers (assumption based purely on CPU monitoring).

但是,现在很奇怪,因为。当我们运行 nodetool status system.peers 表中报告的<187

However, things are now very odd because the .187 that was reported in the system.peers tables is appearing when we run a nodetool status from any server in the cluster other than the new one we just brought online.

$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
DN  54.158.22.187   ?          256     ?       null                                  r1
Datacenter: cassandra-ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  54.255.xx.xx    7.9 GB     256     ?       a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.255.xx.xx    8.2 GB     256     ?       b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: cassandra-eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  176.34.xx.xxx   8.51 GB    256     ?       30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.xx.xxx   8.4 GB     256     ?       f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
Datacenter: cassandra-us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns    Host ID                               Rack
UN  54.225.xx.xxx   8.17 GB    256     ?       0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  54.224.xx.xxx   3.66 GB    256     ?       1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d

我可以做些什么来摆脱这个流氓节点?

What can I do to get rid of this rogue node?

注意:以下是来自describeecluster的结果

Note: Here is the result from a describecluster

$ nodetool describecluster
Cluster Information:
  Name: XXX
  Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
  Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
  Schema versions:
    d140bc9b-134c-3dbe-929f-7a84c2cd4532: [54.255.17.28, 176.34.207.151, 54.225.11.249, 54.195.174.72, 54.224.182.94, 54.255.64.1]

    UNREACHABLE: [54.158.22.187]


推荐答案

但可能唯一要做的是让刺杀端点。这是Cassandra 2.2中的nodetool命令( nodetool assassinate )。但在该版本之前,唯一的方法是通过JMX。以下是具有详细说明的Gist (由 Justen Walker )。

I've never had to do this myself, but probably the only thing left for you to do is to assassinate the endpoint. This was made into a nodetool command (nodetool assassinate) in Cassandra 2.2. But prior to that version, the only way to do it is via JMX. Here's a Gist with detailed instructions (instructions and code by Justen Walker).


先决条件

Prerequisites


  1. 登录现有的群集活动节点

  1. Log onto existing cluster alive node


Download JMX Term

wget



$ wget -q -O jmxterm.jar
> http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
> curl




or



 $ curl -s -o jmxterm.jar
 http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar





  1. 运行jmxterm




$ java -jar ./jmxterm.jar
Welcome to JMX terminal. Type "help" for available commands.
$>




暗杀节点

Assassinate node

示例坏节点:10.0.0.100

Example bad node: 10.0.0.100


  • 连接到本地群集

  • Gossiper MBean使用错误节点的ip运行
    unsafeAssassinateEndpoint



$>open
localhost:7199
#Connection to localhost:7199 is opened 

$>bean org.apache.cassandra.net:type=Gossiper
#bean is set to org.apache.cassandra.net:type=Gossiper

$>run unsafeAssassinateEndpoint 10.0.0.100
#calling operation unsafeAssassinateEndpoint of mbean org.apache.cassandra.net:type=Gossiper
#operation returns: null 

$>quit

更新20160308:


我从来没有自己这样做

I've never had to do this myself

只是自己这样做。完全查找,并按照我自己的答案中的步骤。

Just had to do this myself. Totally looked-up and followed the steps in my own answer, too.

这篇关于Cassandra主机在群集中具有空ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆