使用Cassandra事件重复数据删除 [英] Event de-duplication using Cassandra

查看:244
本文介绍了使用Cassandra事件重复数据删除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在寻找使用Cassandra来重复事件的最佳方法。

I'm looking for the best way to de-duplicate events using Cassandra.

我有很多客户端接收事件id(每秒数千)。我需要确保每个事件ID被处理一次,只有一次具有高可靠性和高可用性。

I have many clients receiving event id's (thousands per second). I need to ensure that each event id is processed once and only once with high reliability and high availability.

到目前为止,我已经尝试了两种方法:

So far I've tried two methods:


  1. 使用事件id作为分区键,并执行INSERT ... IF NOT EXISTS。如果失败,则事件是重复的,并且可以被丢弃。这是一个很好的干净的方法,但是吞吐量不是很大,由于Paxos,特别是有更高的复制因素,如3.它也很脆弱,因为IF NOT EXISTS总是需要法定人数工作,没有办法回到较低如果仲裁不可用,则为一致性。因此,下面几个节点将完全阻止某些事件ID被处理。

  1. Use the event id as a partition key, and do an "INSERT ... IF NOT EXISTS". If that fails, then the event is a duplicate and can be dropped. This is a nice clean approach, but the throughput is not great due to Paxos, especially with higher replication factors such as 3. It's also fragile, since IF NOT EXISTS always requires a quorum to work and there's no way to back down to a lower consistency if a quorum isn't available. So a couple of down nodes will completely block some event id's from being processed.

允许客户端在同一个事件ID上碰撞,聚类列。因此,使用事件id作为分区键插入,并将客户端生成timeuuid作为集群列。然后客户端将等待一段时间(如果其他客户端插入相同的分区键),然后读取具有limit 1的事件id,以返回最旧的集群行。如果它读回的timeuuid匹配它插入的内容,那么它是胜者并且处理该事件。

Allow clients to collide on the same event id, but then detect the collision using a clustering column. So insert using the event id as a partition key, and a client generated timeuuid as a clustering column. The client will then wait a while (in case other clients are inserting the same partition key) and then do a read of the event id with limit 1, to return the oldest clustered row. If the timeuuid it reads back matches what it inserted, then it is the "winner" and processes the event. If the timeuuid does not match, then it is a duplicate and can be dropped.

碰撞(面包师的算法)如果timeuuid不匹配,方法具有比使用IF NOT EXISTS更好的吞吐量和可用性,但是它更复杂,并且更有风险。例如,如果客户端上的系统时钟超时,则重复事件将看起来像是非重复的。所有我的客户端和Cass节点都使用NTP,但这并不总是完美的同步时钟。

The collision (baker's algorithm) approach has much better throughput and availability than using IF NOT EXISTS, but it's more complex and feels more risky. For example if the system clock on a client is out of whack, then a duplicate event would look like a non-duplicate. All my client and Cass nodes use NTP, but that's not always perfect at synchronizing clocks.

任何人都有建议使用哪种方法?还有另外一种方法可以做到这一点?

Anyone have a suggestion for which approach to use? Is there another way to do this?

还要注意,我的集群将设置三个数据中心,DC之间的延迟大约为100 ms。

Also note that my cluster will be set up with three data centers with about 100 ms latency between DC's.

感谢。

推荐答案

我认为从所有提出的解决方案,你的第二个是最好的。但是相反,通过聚集列存储只有最旧的值,我会存储所有事件,以保持它的历史顺序从最旧到最新(当插入你不必检查是否已经存在,是最旧的等,那么你可以选择一个与最旧的writetime属性)。然后我会选择最老的处理,因为你写。因为cassandra看不到插入或upsert之间的区别,我没有看到任何替代品做它与cassandra或有人说 - 这外面。

I think that from all proposed solutions your second one is the best. But instead storing only the oldest value by clustered column I would store all events to keep it history ordered from oldest to newest ( when inserting you don't have to check if already exists and is oldest etc, then you can select the one with the oldest writetime attribute ). Then I would select the oldest for processing as you wrote. Since cassandra see no difference between insert or upsert I don't see any alternatives to do it with cassandra or as someone said - do this outside.

这篇关于使用Cassandra事件重复数据删除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆