为什么Cassandra集群需要节点之间的同步时钟? [英] Why Cassandra cluster need synchronized clocks between nodes?

查看:132
本文介绍了为什么Cassandra集群需要节点之间的同步时钟?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cassandra DataStax的入门课程中,他们说Cassandra群集节点的所有时钟,必须进行同步,以防止对旧数据进行READ查询。

In the introduction course of Cassandra DataStax they say that all of the clocks of a Cassandra cluster nodes, have to be synchronized, in order to prevent READ queries to 'old' data.

如果一个或多个节点发生故障,它们将无法获取更新,但只要它们再次备份-它们会更新,并且没有问题...

If one or more nodes are down they can not get updates, but as soon as they back up again - they would update and there is no problem...

那么,为什么Cassandra集群需要节点之间的同步时钟?

So, why Cassandra cluster need synchronized clocks between nodes?

推荐答案

通常,保持服务器时钟同步始终是一个好主意,但是在节点之间需要时钟同步的主要原因是因为Cassandra使用了一种称为最后一次写入获胜可解决冲突并确定哪个突变代表最正确的最新数据状态。 为什么cassandra不需要矢量时钟

In general it is always a good idea to keep your server clocks in sync, but a primary reason why clock sync is needed between nodes is because Cassandra uses a concept called 'Last Write Wins' to resolve conflicts and determine which mutation represents the most correct up-to date state of data. This is explained in Why cassandra doesn't need vector clocks.

每当您在cassandra中更改(写入或删除)列时,协调员就会为您的请求分配一个时间戳。该时间戳记与单元格中的列值一起写入。

Whenever you 'mutate' (write or delete) column(s) in cassandra a timestamp is assigned by the coordinator handling your request. That timestamp is written with the column value in a cell.

当发生读取请求时,cassandra会构建您的结果,以查找查询条件的突变,并且当看到多个代表同一列的单元格时,它将选择最最近的时间戳(读取路径比这更复杂,但这是您需要在此上下文中了解的所有信息。)

When a read request occurs, cassandra builds your results finding the mutations for your query criteria and when it sees multiple cells representing the same column it will pick the one with the most recent timestamp (The read path is more involved than this but that is all you need to know in this context).

当节点的时钟变成时钟时,事情开始变得有问题不同步。如前所述,处理您的请求的协调器节点分配时间戳。如果您对同一列进行多个更改并分配了不同的协调器,则可以创建一些情况,其中返回过去发生的写入而不是最近发生的写入。

Things start to become problematic when your nodes' clocks become out of sync. As I mentioned, the coordinator node handling your request assigns the timestamp. If you do multiple mutations to the same column and different coordinators are assigned, you can create some situations where writes that happened in the past are returned instead of the most recent one.

这是描述以下情况的基本场景:

Here is a basic scenario that describes that:

假定我们有一个包含节点A和B的2节点集群。让我们假设初始状态为A在时间 t10 ,而B在时间 t5

Assume we have a 2 node cluster with nodes A and B. Lets assume an initial state where A is at time t10 and B is at time t5.


  1. 用户从tbl WHERE key = 5 中执行 D删除C。节点A协调请求,并为其分配时间戳 t10

  2. 第二遍,用户执行 UPDATE tbl SET C ='data'其中key = 5 。节点B协调请求,并为其分配时间戳 t6

  3. 用户从tbl执行查询 SELECT C其中key = 5 。由于步骤1中的 DELETE 具有更新的时间戳( t10> t6 ),因此不会返回任何结果。 / li>
  1. User executes DELETE C FROM tbl WHERE key=5. Node A coordinates the request and it is assigned timestamp t10.
  2. A second passes and a User executes UPDATE tbl SET C='data' where key=5. Node B coordinates the request and it is assigned timestamp t6.
  3. User executes the query SELECT C from tbl where key=5. Because the DELETE from Step 1 has a more recent timestamp (t10 > t6), no results are returned.

请注意,较新版本的datastax驱动程序将开始默认使用客户端时间戳来使客户端应用程序生成并为请求分配时间戳,而不是依靠C *节点分配它们。从3.0版开始,datastax Java驱动程序现在默认为客户端时间戳(有关详细信息,请参见客户端生成 )。如果所有请求都来自同一个客户端,这非常好,但是,如果您有多个应用程序在向cassandra写入数据,那么您现在就不得不担心保持客户端时钟同步。

Note that newer versions of the datastax drivers will start defaulting to use Client Timestamps to have your client application generate and assign timestamps to requests instead of relying on the C* nodes to assign them. datastax java-driver as of 3.0 now defaults to client timestamps (read more about there in 'Client-side generation'). This is very nice if all requests come from the same client, however if you have multiple applications writing to cassandra you now have to worry about keeping your client clocks in sync.

这篇关于为什么Cassandra集群需要节点之间的同步时钟?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆