在现有键空间上使用 cqlsh 创建新表:列族 ID 不匹配 [英] Creating new table with cqlsh on existing keyspace: Column family ID mismatch

查看:13
本文介绍了在现有键空间上使用 cqlsh 创建新表:列族 ID 不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

休斯顿,我们遇到了问题.

Houston, we have a problem.

尝试在现有 Cassandra (v2.1.3) 键空间上使用 cqlsh 创建新表会导致:

Trying to create a new table with cqlsh on an existing Cassandra (v2.1.3) keyspace results in:

ServerError: 
<ErrorMessage code=0000 [Server error] message="java.lang.RuntimeException:
java.util.concurrent.ExecutionException: 
    java.lang.RuntimeException:      
        org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found e8c03790-c952-11e4-a753-5981ea73cd7c; expected e8b14370-c952-11e4-a844-8f10bfb9c386)">

在第一次创建尝试后,再次尝试将导致:

After the first create attempt, trying once more will result in:

AlreadyExists:表 'ks.metrics' 已经存在

AlreadyExists: Table 'ks.metrics' already exists

但是检索键空间desc tables;的现有表的列表不会报告新表.

But retrieving the list of existing tables for the keyspace desc tables; will not report the new table.

该问题似乎与 Cassandra-8387 相关,但只有一个客户端尝试创建表:cqlsh

The issue seems related to Cassandra-8387 except that there's only one client trying to create the table: cqlsh

我们确实有一堆 Spark 作业,它们将在启动时创建键空间和表,并可能并行执行此操作.这会导致密钥空间损坏吗?

We do have a bunch of Spark jobs that will create the keyspaces and tables at startup, potentially doing this in parallel. Would this render the keyspace corrupt?

创建新的键空间并向其添加表按预期工作.

Creating a new keyspace and adding a table to it works as expected.

有什么想法吗?

更新

找到了一个解决方法:对键空间进行修复,表将出现(desc 表)并且也能正常工作.

Found a workaround: issue a repair on the keyspace and the tables will appear (desc tables) and are also functional.

推荐答案

简短回答: 他们有竞争条件,他们认为他们在 1.1.8 中解决了这个问题...

Short answer: They have a race condition, which they think they resolved in 1.1.8...

长答案:

我一直在我的一个集群上收到该错误.我的测试机器的硬盘驱动器非常慢,当我在两台不同的计算机上有 4 个节点时,创建一两个表就足以导致错误.

I get that error all the time on one of my clusters. I have test machines that have really slow hard drives and creating one or two tables is enough to get the error when I have 4 nodes on two separate computers.

下面是我的 Cassandra 3.7 安装的堆栈跟踪副本.虽然您的版本是 2.1.3,但我会惊讶于这部分代码发生了如此大的变化.

Below I have a copy of the stack trace from my Cassandra 3.7 installation. Although your version was 2.1.3, I would be surprised that this part of the code changed that much.

如我们所见,异常发生在 validateCompatibility() 函数中.这要求新旧版本的 MetaData 具有这些相等性:

As we can see, the exception happens in the validateCompatibility() function. This requires that the new and old versions of the MetaData have these equal:

  • ksName(键空间名称)
  • cfName(列族名称)
  • cfId(列族 UUID)
  • 标志(isSuper、isCounter、isDense、isCompound)
  • comparator(键排序比较器)

如果这些值中的任何一个在旧元数据和新元数据之间不匹配,则该过程会引发异常.在我们的例子中,cfId 值是不同的.

If any one of these values do not match between the old and new meta data, then the process raises an exception. In our case, the cfId values are different.

向上堆栈,我们有 apply() 立即调用 validateCompatibility().

Going up the stack, we have the apply() which calls validateCompatibility() immediately.

接下来我们有 updateTable().类似地,它几乎立即调用 apply().首先,它调用 getCFMetaData() 来检索要与新数据进行比较的当前列族数据(旧").

Next we have updateTable(). Similarly, it calls apply() nearly immediately. First it calls the getCFMetaData() to retrieve the current column family data ("old") that is going to be compared against the new data.

接下来我们看到updateKeyspace().该函数计算 diff 以了解发生了什么变化.然后它将其保存在每种类型的数据中.表格在 Type 之后排在第二位...

Next we see updateKeyspace(). That function calculates a diff to know what changed. Then it saves that in each type of data. Table is 2nd after Type...

在此之前,他们使用 mergeSchema() 来计算 Keyspace 级别的变化.然后它删除被删除的键空间并为更新的键空间(以及新的键空间)生成新的键空间.最后,它们遍历新的键空间,为每个键空间调用 updateKeyspace().

Before that they have the mergeSchema() which calculates what changed at the Keyspace level. It then drops keyspaces that were deleted and generate new keyspaces for those that were updated (and for new keyspaces). Finally, they loop over the new keyspaces calling updateKeyspace() for each one of them.

接下来在堆栈中我们看到一个有趣的函数:mergeSchemaAndAnnounceVersion().一旦密钥空间在内存和磁盘上更新,这个将更新版本.架构的版本包括不兼容的 cfID,因此会生成异常.Announce 部分是向其他节点发送八卦消息,告知该节点现在知道某个模式的新版本.

Next in the stack we see an interesting function: mergeSchemaAndAnnounceVersion(). This one will update the version once the keyspaces were updated in memory and on disk. The version of the schema includes that cfID that is not compatible and thus generates the exception. The Announce part is to send a gossip message to the other nodes about the fact that this node now knows of the new version of a certain schema.

接下来我们会看到一个叫做 MigrationTask 的东西.这是用于在 Cassandra 节点之间迁移更改的消息.消息负载是一组更改(由 mergeSchema() 函数处理的那些.)

Next we see something called MigrationTask. This is the message used to migrate changes between Cassandra nodes. The message payload is a collection of mutations (those handled by the mergeSchema() function.)

堆栈的其余部分仅显示 run() 函数,这些函数是用于处理消息的各种类型的函数.

The rest of the stack just shows run() functions that are various types of functions used to handle messages.

就我而言,对我来说,问题会在稍后解决,一切都很好.我对架构最终同步无事可做.正如预期的那样.但是,它阻止我一次性创建所有表.因此,我认为迁移消息没有按预期顺序到达.必须有一个超时,通过重新发送事件来处理并产生混淆.

In my case, for me the problem gets resolved a little later and all is well. I have nothing to do for the schema to finally get in sync. as expected. However, it prevents me from creating all my tables in one go. So, my take looking at this is that the migration messages do not arrive in the expected order. There must be a timeout which is handled by resending the event and that generates the mix-up.

那么,首先让我们看看发送消息的代码,您会在 MigrationManager 中看到该代码.这里我们有一个 MIGRATION_DELAY_IN_MS 参数与一个旧问题的链接,Schema推/拉竞赛,这是为了避免竞争条件.嗯...你去.所以他们意识到可能存在竞争条件并试图避免它,他们在那里增加了一点延迟.该修复的一部分包括版本检查.如果版本已经相同,则完全避免更新(即忽略该八卦).

So, lets look at the code sending the message in the first place, you see that one in the MigrationManager. Here we have a MIGRATION_DELAY_IN_MS parameter in link with an old issue, Schema push/pull race, which was to avoid a race condition. Well... there you go. So they are aware that there is a possible race condition and to try to avoid it, they added a little delay there. One part of that fix includes a version check. If the versions are already equal, avoid the update altogether (i.e. ignore that gossip).

if (Schema.instance.getVersion().equals(currentVersion))
{
    logger.debug("not submitting migration task for {} because our versions match", endpoint);
    return;
}

我们所说的延迟是一分钟:

The delay we are talking about is one minute:

public static final int MIGRATION_DELAY_IN_MS = 60000;

人们会认为一分钟就足够了,但不知何故,我仍然总是遇到错误.

One would think that one whole minute would suffice, but somehow I still get the error all the time.

事实是,他们的代码不希望一个接一个地发生多个更改,包括像我这样的大延迟.所以如果我要创建一张表,然后做其他事情,我会很好.另一方面,当我想在这些慢速机器上连续创建 20 个表时,来自先前架构更改的八卦消息迟到(即在新的 CREATE TABLE 命令到达该节点之后.)那是我收到错误的时候.我想,最糟糕的部分是它是一个虚假错误(即它告诉我八卦晚了,而不是我的架构无效并且八卦消息中的架构是旧的.)

The fact is that their code does not expect multiple changes happening one after the other including large delays like I have. So if I were to create one table, and then do other things, I'd be just fine. On the other hand, when I want to create 20 tables in a row on those slow machines, the gossiping message from a previous schema change arrives late (i.e. after the new CREATE TABLE command arrived to that node.) That's when I get that error. The worst part, I guess, is that it is a spurious error (i.e. it is telling me that the gossip was later, and not that my schema is invalid and the schema in the gossip message is an old one.)

org.apache.cassandra.exceptions.ConfigurationException: Column family ID mismatch (found 122a2d20-9e13-11e6-b830-55bace508971; expected 1213bef0-9e
    at org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:790) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:750) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.config.Schema.updateTable(Schema.java:661) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1350) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1306) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1256) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:92) ~[apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) [apache-cassandra-3.9.jar:3.9]
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:64) [apache-cassandra-3.9.jar:3.9]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_111]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_111]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_111]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_111]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]

这篇关于在现有键空间上使用 cqlsh 创建新表:列族 ID 不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆