Cassandra 2.1系统架构丢失 [英] Cassandra 2.1 system schema missing

查看:64
本文介绍了Cassandra 2.1系统架构丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行cassandra 2.1.6的六节点群集。昨天,我尝试删除列族,并收到消息 列族ID不匹配

我尝试运行 nodetool repair ,但是修复完成后,我得到了相同的消息。然后,我尝试从列族中进行选择,但收到消息 未找到列族

我运行以下查询来获取我的架构中所有列族的列表

从system.schema_columnfamilies中选择columnfamily_name,其中keyspace_name ='xxx';

此时,我收到了消息
找不到键空间'system'。

我尝试了命令描述键空间,并确保没有足够的 system 不在键空间列表中。


然后我在缺少 system 键空间的节点之一上尝试了 nodetool resetlocalshema ,当该尝试无法解决问题时,我尝试了 nodetool重建,但是在重建完成后得到了相同的消息。

我尝试停止缺少 system 键空间的节点并重新启动它们,一旦重新启动完成, system 键空间又回来了,我能够成功执行上述查询。但是,以前尝试删除的表未列出,因此我尝试重新创建它,并再次收到消息列族ID不匹配


最后,我关闭了群集并重新启动了它,一切都按预期进行。


我的问题是:
系统键空间如何/为什么消失?
当六个节点中的两个节点缺少系统键空间时,插入到我的列族中的数据发生了什么? (我的应用程序似乎没有任何问题)
有没有办法我可以自动检测到此类问题,还是必须每天手动检查我的键空间?
是否可以在不重新启动整个群集的情况下修复丢失的 system 键空间和/或列族ID不匹配

I have a six node cluster running cassandra 2.1.6. Yesterday I tried to drop a column family and received the message "Column family ID mismatch".

I tried running nodetool repair but after repair was complete I got the same message. I then tried selecting from the column family but got the message "Column family not found".

I ran the following query to get a list of all column families in my schema
select columnfamily_name from system.schema_columnfamilies where keyspace_name = 'xxx';
At this point I received the message "Keyspace 'system' not found."

I tried the command describe keyspaces and sure enough system was not in the list of keyspaces.

I then tried nodetool resetlocalshema on one of the nodes missing the system keyspace and when that failed to resolve the problem I tried nodetool rebuild but got the same messages after rebuild was complete.

I tried stopping the nodes missing the system keyspace and restarted them, once the restart was completed the system keyspace was back and I was able to execute the above query successfully. However, the table I had tried to drop previously was not listed so I tried to recreate it and once again received the message Column family ID mismatch.

Finally, I shutdown the cluster and restarted it... and everything works as expected.

My questions are:
How/why did the system keyspace disappear?
What happened to the data being inserted into my column families while the system keyspace was missing from two of the six nodes? (my application didn't seem to have any problems)
Is there a way I can detect problems like this automatically or do I have to manually check up on my keyspaces each day?
Is there a way to fix the missing system keyspace and/or the Column family ID mismatch without restarting the entire cluster?


编辑

根据Jim Meyers的建议,我查询了 cf_id 的每个节点并确认所有节点都返回相同的值。

EDIT
As per Jim Meyers suggestion I queried the cf_id on each node of the cluster and confirmed that all nodes return the same value.


从system.schema_columnfamilies中选择cf_id,其中columnfamily_name ='customer'允许过滤;

cf_id

------------------------ --------------

cbb51b40-2b75-11e5-a578-798867d9971f


然后我跑了 ls 在我的数据目录上,可以看到我的一些表有多个条目

customer-72bc62d0ff7611e4a5b53386c3f1c9f9

customer-cbb51b402b7511e5a578798867d9971f


我的应用程序在运行时动态创建表(始终使用 IF NOT EXISTS ),似乎该应用程序同时在单独的节点上发出了相同的create table命令,导致模式不匹配。
自从重新启动群集以来,一切似乎都工作正常。


删除多余的文件是否安全?

customer-72bc62d0ff7611e4a5b53386c3f1c9f9

cf_id
--------------------------------------
cbb51b40-2b75-11e5-a578-798867d9971f

I then ran ls on my data directory and can see that there are multiple entries for a few of my tables
customer-72bc62d0ff7611e4a5b53386c3f1c9f9
customer-cbb51b402b7511e5a578798867d9971f

My application dynamically creates tables at run time (always using IF NOT EXISTS), seems likely that the application issued the same create table command on separate nodes at the same time resulting in the schema mismatch. Since I've restarted the cluster everything seems to be working fine.

Is it safe to delete the extra file?
i.e. customer-72bc62d0ff7611e4a5b53386c3f1c9f9



推荐答案

1造成此问题的原因是CREATE TABLE语句冲突。即使不存在,也不要从多个客户端动态生成表。您需要做的第一件事就是修复代码,以免发生这种情况。只需从cqlsh手动创建表,就可以花些时间来解决架构。修改架构时,请始终等待架构协议

1 The cause of this problem is a CREATE TABLE statement collision. Do not generate tables dynamically from multiple clients, even with IF NOT EXISTS. First thing you need to do is fix your code so that this does not happen. Just create your tables manually from cqlsh allowing time for the schema to settle. Always wait for schema agreement when modifying schema.

2这是解决方法:

1)更改代码以不自动重新创建表(即使IF NOT EXISTS也是如此) )。

1) Change your code to not automatically re-create tables (even with IF NOT EXISTS).

2)运行滚动重启,以确保架构在节点之间匹配。在集群周围运行nodetool describecluster。检查是否只有一个架构版本。

2) Run a rolling restart to ensure schema matches across nodes. Run nodetool describecluster around your cluster. Check that there is only one schema version. 

在每个节点上:

3)检查您的文件系统,并查看表中是否有两个目录数据目录中的问题。

3) Check your filesystem and see if you have two directories for the table in question in the data directory.

如果存在两个或更多个目录:

If THERE ARE TWO OR MORE DIRECTORIES:

4)从schema_column_families中识别cf ID是新(当前正在使用)。

4)Identify from schema_column_families which cf ID is the "new" one (currently in use). 

cqlsh -e select * from system.schema_column_families | grep

cqlsh -e "select * from system.schema_column_families"|grep

5)从将旧目录改为新目录,然后删除旧目录。

5) Move the data from the "old" one to the "new" one and remove the old directory. 

6)如果有多个旧目录,则对每个旧目录重复5。

6) If there are multiple "old" ones repeat 5 for every "old" directory.

7 )运行nodetool refresh

7) run nodetool refresh

如果只有一个目录:

无需采取进一步措施。

模式冲突将一直是一个问题,直到- CASSANDRA-9424

Schema collisions will continue to be an issue until - CASSANDRA-9424

以下是在Jira上发生并以没问题 CASSANDRA-8387

Here's an example of it occurring on Jira and closed as not a problem CASSANDRA-8387

这篇关于Cassandra 2.1系统架构丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆