LOCAL_ONE和使用Cassandra进行意外数据复制 [英] LOCAL_ONE and unexpected data replication with Cassandra

查看:943
本文介绍了LOCAL_ONE和使用Cassandra进行意外数据复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

FYI。我们使用Cassandra 2.1.12.1047 |运行这个测试DSE 4.8.4

FYI. We are running this test with Cassandra 2.1.12.1047 | DSE 4.8.4

我们在Cassandra中有一个简单的表,其中有5000行数据。有些时候,作为一个预防措施,我们增加了对每个Cassandra实例的监控,以确保它有5000行数据,因为我们的复制因素强制这样做,即我们在每个区域有2个副本,在我们的dev集群中总共有6个服务器。

We have a simple table in Cassandra that has a 5,000 rows of data in it. Some time back, as a precaution, we added monitoring on each Cassandra instance to ensure that it has 5,000 rows of data because our replication factor enforces this i.e. we have 2 replicas in every region and we have 6 servers in total in our dev cluster.

CREATE KEYSPACE示例WITH replication = {'class':'NetworkTopologyStrategy','ap-southeast-1-A':'2' 'eu-west-1-A':'2','us-east-1-A':'2'} AND durable_writes = true;

我们最近强行终止了一个服务器来模拟一个失败,并带来了一个新的在线,看看会发生什么。我们还使用 nodetool removenode 删除了旧节点,因此在每个区域中,我们期望每个服务器上都存在所有数据。

We recently forcibly terminated a server to simulate a failure and brought a new one online to see what would happen. We also removed the old node using nodetool removenode so that in each region we expected all data to exist on every server.

新服务器联机后,它加入了集群,似乎开始复制数据。我们假设,因为它是在引导模式,它将负责确保它从集群中获取所需的数据。

Once the new server came online, it joined the cluster, and seemingly started replicating the data. We assume because it is in bootstrap mode it will be responsible for ensuring it gets the data it needs from the cluster. CPU finally dropped after around an hour, and we assumed the replication was complete.

但是,我们的显示器有意使用 LOCAL_ONE 在每个服务器上,报告所有服务器有5000行,并且联机的新服务器卡住了大约2,600行。我们假设它可能仍然是复制,所以我们离开了一段时间,但它保持在那个数字。

However, our monitors, which intentionally do queries using LOCAL_ONE on each server, reported that all servers had 5,000 rows, and the new server that was brought online was stuck with around 2,600 rows. We assumed that perhaps it was still replicating so we left it a while, but it stayed at that number.

所以我们运行nodetool状态检查,并得到以下: / p>

So we ran nodetool status to check and got the following:

$ nodetool status my_keyspace
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.255.17.28    7.9 GB     256     100.0%            a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.255.64.1     8.2 GB     256     100.0%            b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  176.34.207.151  8.51 GB    256     100.0%            30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.174.72   8.4 GB     256     100.0%            f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.225.11.249   8.17 GB    256     100.0%            0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  54.224.182.94   3.66 GB    256     100.0%            1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d 

因此,如果服务器报告它拥有100%的数据,为什么 LOCAL_ONE 查询只给我们大约一半的数据?

So if the server is reporting that it owns 100% of the data, why is the LOCAL_ONE query only giving us roughly half the data?

我确实运行了一个 LOCAL_QUORUM 查询返回了5000行,从那里向前返回5000,即使是 LOCAL_ONE 查询。

When I did run a LOCAL_QUORUM query it returned 5,000 rows, and from that point forwards returned 5,000 even for LOCAL_ONE queries.

虽然 LOCAL_QUORUM 在此实例中解决了问题,我们可能会在将来需要对其他类型的查询假设每个服务器a)具有它应该具有的数据,b)知道当没有数据时如何满足查询,即它知道数据位于环上的其他地方。

Whilst LOCAL_QUORUM solved the problem in this instance, we may in future need to do other types of queries on the assumption that each server a) has the data it should have, b) knows how to satisfy queries when it does not have the data i.e. it knows that data sits somewhere else on the ring.

24小时后进一步更新 - 问题是多方面的

因此没有任何关于这个问题的意见,通过添加更多的节点,在集群上进行实验。根据 https://docs.datastax.com/en/cassandra /1.2/cassandra/operations/ops_add_node_to_cluster_t.html ,我已按照建议的所有步骤向群集添加节点,并且实际上,添加容量。我相信Cassandra的前提是,当你添加节点,这是群集的责任重新平衡数据,在那段时间,从环上的位置,如果它不是应该是在哪里获得数据。

So in the absence of any feedback on this issue, I have proceeded to experiment with this on the cluster by adding more nodes. According to https://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html, I have followed all the steps recommended to add nodes to the cluster and in effect, add capacity. I believe the premise of Cassandra is that as you add nodes, it is the Cluster's responsibility to rebalance the data and during that time, get the data from the position on the ring it is at if it's not where it should be.

不幸的是,根本不是这样。这是我的新圈子:

Unfortunately that is not the case at all. Here is my new ring:

Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.255.xxx.xxx  8.06 GB    256     50.8%             a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.254.xxx.xxx  2.04 MB    256     49.2%             e2e2fa97-80a0-4768-a2aa-2b63e2ab1577  ap-southeast-1a
UN  54.169.xxx.xxx  1.88 MB    256     47.4%             bcfc2ff0-67ab-4e6e-9b18-77b87f6b3df3  ap-southeast-1b
UN  54.255.xxx.xxx  8.29 GB    256     52.6%             b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.78.xxx.xxx   8.3 GB     256     49.9%             30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.xxx.xxx  8.54 GB    256     50.7%             f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
UN  54.194.xxx.xxx  5.3 MB     256     49.3%             3789e2cc-032d-4b26-bff9-b2ee71ee41a0  eu-west-1c
UN  54.229.xxx.xxx  5.2 MB     256     50.1%             34811c15-de8f-4b12-98e7-0b4721e7ddfa  eu-west-1b
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.152.xxx.xxx  5.27 MB    256     47.4%             a562226a-c9f2-474f-9b86-46c3d2d3b212  us-east-1d
UN  54.225.xxx.xxx  8.32 GB    256     50.3%             0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  52.91.xxx.xxx   5.28 MB    256     49.7%             524320ba-b8be-494a-a9ce-c44c90555c51  us-east-1e
UN  54.224.xxx.xxx  3.85 GB    256     52.6%             1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d

正如你将看到的,我已经将戒指的尺寸增加了一倍,有效所有权大约是50%服务器(我的复制因子是每个地区2份)。然而,你可以看到一些服务器对它们完全没有负载(它们是新的),而其他服务器对它们有过多的负载(它们是旧的,并且显然没有发生数据分配)。

As you will see, I have doubled the size of the ring and the effective ownership is roughly 50% per server as expected (my replication factor is 2 copies in every region). However, owrringly you can see that some servers have absolutely no load on them (they are new), whilst others have excessive load on them (they are old and clearly no distribution of data has occurred).

现在这本身不是担心,因为我相信Cassandra的能力和它的能力,最终获得数据在正确的地方。让我非常担心的是,我的完全 5,000行的表现在在我的三个区域中不再有5000行。

Now this in itself is not the worry as I believe in the powers of Cassandra and its ability to eventually get the data in the right place. The thing that worries me immensely is that my table with exactly 5,000 rows now no longer has 5,000 rows in any of my three regions.

# From ap-southeast-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  3891

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4633


# From eu-west-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  1975

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4209


# From us-east-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4435

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4870


$ b b

那么认真,这里发生了什么?让我们回顾一下:

So seriously, what is going on here? Lets recap:


  • 我的复制因子是'ap-southeast-1-A':'2' eu-west-1-A':'2','us-east-1-A':'2',所以每个区域都应该能够完全满足查询。

  • 启用新实例不应该导致我的数据丢失,但显然我们甚至可以使用LOCAL_QUORUM

  • 每个地区对数据,但我没有介绍任何新的数据,只有新的服务器,然后自动引导。

  • my replication factor is 'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2' so every region should be able to satisfy a query in full.
  • Bringing on new instances should not cause me to have data loss, yet apparently we do even with LOCAL_QUORUM
  • Every region has a different view on the data yet I have not introduced any new data, only new servers that then bootstrap automatically.

那么我想, QUORUM 查询整个多区域集群。不幸的是完全失败:

So then I thought, why not do a QUORUM query across the entire multi-region cluster. Unfortunately that fails completely:

cqlsh> CONSISTENCY QUORUM;
Consistency level set to QUORUM.

cqlsh> select count(*) from health_check_data_consistency;
OperationTimedOut: errors={}, last_host=172.17.0.2

c $ c> TRACING ON; 并且失败了。我可以在日志中看到以下内容:

I then turned TRACING ON; and that failed too. All I can see in the logs is the following:

INFO  [SlabPoolCleaner] 2016-03-03 19:16:16,616  ColumnFamilyStore.java:1197 - Flushing largest CFS(Keyspace='system_traces', ColumnFamily='events') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.02/0.02
INFO  [SlabPoolCleaner] 2016-03-03 19:16:16,617  ColumnFamilyStore.java:905 - Enqueuing flush of events: 5624218 (2%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1126] 2016-03-03 19:16:16,617  Memtable.java:347 - Writing Memtable-events@732346653(1.102MiB serialized bytes, 25630 ops, 2%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:1126] 2016-03-03 19:16:16,821  Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system_traces/events/system_traces-events-tmp-ka-3-Data.db (298.327KiB) for commitlog position ReplayPosition(segmentId=1456854950580, position=28100666
)
INFO  [ScheduledTasks:1] 2016-03-03 19:16:21,210  MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms: 212 for internal timeout and 0 for cross node timeout

这对我运行查询的每个服务器都会发生。

This happens on every single server I run the query on.

检查群集,看起来一切都是同步的

Checking the cluster, it seems everything is in sync

$ nodetool describecluster;
Cluster Information:
    Name: Ably
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
            51e57d47-8870-31ca-a2cd-3d854e449687: [54.78.xxx.xxx, 54.152.xxx.xxx, 54.254.xxx.xxx, 54.255.xxx.xxx, 54.195.xxx.xxx, 54.194.xxx.xxx, 54.225.xxx.xxx, 52.91.xxx.xxx, 54.229.xxx.xxx, 54.169.xxx.xxx, 54.224.xxx.xxx, 54.255.xxx.xxx]

1小时后进一步更新

有人建议,只是向下到范围查询不能按预期工作。我这样写了一个简单的脚本,每个5k行单独查询(它们的ID范围1-> 5,000)。不幸的是,结果是我担心,我缺少数据。我已尝试过 LOCAL_ONE LOCAL_QUORUM 和事件 QUORUM

Someone suggested that perhaps this was simply down to range queries not working as expected. I thus wrote a simple script that queried for each of the 5k rows individually (they have an ID range 1->5,000). Unfortunately the results are as I feared, I have missing data. I have tried this with LOCAL_ONE, LOCAL_QUORUM and event QUORUM.

ruby> (1..5000).each { |id| put "#{id} missing" if session.execute("select id from health_check_data_consistency where id = #{id}", consistency: :local_quorum).length == 0 }
19 missing, 61 missing, 84 missing, 153 missing, 157 missing, 178 missing, 248 missing, 258 missing, 323 missing, 354 missing, 385 missing, 516 missing, 538 missing, 676 missing, 708 missing, 727 missing, 731 missing, 761 missing, 863 missing, 956 missing, 1006 missing, 1102 missing, 1121 missing, 1161 missing, 1369 missing, 1407 missing, 1412 missing, 1500 missing, 1529 missing, 1597 missing, 1861 missing, 1907 missing, 2005 missing, 2168 missing, 2207 missing, 2210 missing, 2275 missing, 2281 missing, 2379 missing, 2410 missing, 2469 missing, 2672 missing, 2726 missing, 2757 missing, 2815 missing, 2877 missing, 2967 missing, 3049 missing, 3070 missing, 3123 missing, 3161 missing, 3235 missing, 3343 missing, 3529 missing, 3533 missing, 3830 missing, 4016 missing, 4030 missing, 4084 missing, 4118 missing, 4217 missing, 4225 missing, 4260 missing, 4292 missing, 4313 missing, 4337 missing, 4399 missing, 4596 missing, 4632 missing, 4709 missing, 4786 missing, 4886 missing, 4934 missing, 4938 missing, 4942 missing, 5000 missing

从上面可以看出,这意味着我有大约1.5%的我的数据不再可用。

As you can see from above, that means I have roughly 1.5% of my data no longer available.

所以我很累。我真的需要一些建议在这里,因为我肯定是印象下,Cassandra是专门设计来处理横向扩展需求。任何帮助非常感谢。

So I am stumped. I really need some advice here because I was certainly under the impression that Cassandra was specifically designed to handle scaling out horizontally on demand. Any help greatly appreciated.

推荐答案

我应该说的是,你不能保证一致性和可用性。因为您的仲裁查询本质上是一个ALL查询。当其中一个节点关闭时,查询的唯一方法是降低CL。如果可用节点上的数据不一致,那么不会进行读修复。

What I should have said is you can't guarantee consistency AND availability. Since your quorum query is essentially an ALL query. The only way to query when one of the nodes is down would be to lower CL. And that won't do a read repair if data on the available node is inconsistent.

运行修复后,还需要在旧节点上运行清理以删除不再拥有的数据。此外,修复将不会删除删除/ TTLd数据,直到gc_grace_seconds周期后。所以如果你有任何,它会坚持至少gc_grace_seconds。

After running repair you also need to run cleanup on the old nodes to remove the data they no longer own. Also, repair won't remove deleted/TTLd data until after the gc_grace_seconds period. So if you have any of that, it'll stick around for at least gc_grace_seconds.

你在日志中发现了什么吗?你能分享你的配置吗?

Did you find anything in the logs? Can you share your configuration?

这篇关于LOCAL_ONE和使用Cassandra进行意外数据复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆