cassandra 如何找到包含数据的节点? [英] How does cassandra find the node that contains the data?

查看:23
本文介绍了cassandra 如何找到包含数据的节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了很多关于 Cassandra 的文章和很多问题/答案,但我仍然无法弄清楚 Cassandra 在读取数据时如何决定去哪个节点.

I've read quite a few articles and a lot of question/answers on SO about Cassandra but I still can't figure out how Cassandra decides which node(s) to go to when it's reading the data.

首先,关于假想集群的一些假设:

First, some assumptions about an imaginary cluster:

  1. 复制策略 = 简单
  2. 使用随机分区器
  3. 10 个节点的集群
  4. 复制因子为 5

以下是我根据我阅读过的各种 Datastax 文章和其他博客文章对写入工作方式的理解:

Here's my understanding of how writes work based on various Datastax articles and other blog posts I've read:

  • 客户端将数据发送到随机节点
  • 随机"节点是根据主键的 MD5 哈希值决定的.
  • 数据被写入 commit_log 和 memtable,然后传播 4 次(RF = 5).

  • Client sends the data to a random node
  • The "random" node is decided based on the MD5 hash of the primary key.
  • Data is written to the commit_log and memtable and then propagated 4 times (with RF = 5).

然后选择环中的 4 个下一个节点并将数据保存在其中.

The 4 next nodes in the ring are then selected and data is persisted in them.

到目前为止,一切都很好.

So far, so good.

现在的问题是,当客户端向集群发送读取请求(比如 CL = 3)时,Cassandra 如何知道它需要联系哪些节点(最坏的情况是 10 个节点中的 5 个)以获取该请求数据?当然,它不会到达所有 10 个节点,因为那样效率低下.

Now the question is, when the client sends a read request (say with CL = 3) to the cluster, how does Cassandra know which nodes (5 out of 10 as the worst case scenario) it needs to contact to get this data? Surely it's not going to all 10 nodes as that would be inefficient.

我是否正确地假设 Cassandra 会再次对(请求的)主键进行 MD5 散列并根据该散列选择节点然后遍历环?

Am I correct in assuming that Cassandra will again, do an MD5 hash of the primary key (of the request) and choose the node according to that and then walks the ring?

另外,网络拓扑案例是如何工作的?如果我有多个数据中心,Cassandra 如何知道每个 DC/Rack 中的哪些节点包含数据?据我了解,只有第一个节点是显而易见的(因为主键的散列明确地导致了该节点).

Also, how does the network topology case work? if I have multiple data centers, how does Cassandra know which nodes in each DC/Rack contain the data? From what I understand, only the first node is obvious (since the hash of the primary key has resulted in that node explicitly).

抱歉,如果问题不是很清楚,如果您需要有关我的问题的更多详细信息,请添加评论.

Sorry if the question is not very clear and please add a comment if you need more details about my question.

非常感谢,

推荐答案

客户端将数据发送到随机节点

Client sends the data to a random node

看起来可能是这样,但实际上有一种非随机的方式让您的驱动程序选择要与之对话的节点.该节点称为协调器节点",通常根据网络距离"最小(最近)来选择.客户端请求实际上可以发送到任何节点,首先它们将发送到您的驱动程序知道的节点.但是一旦它连接并了解您的集群的拓扑结构,它可能会变成一个更接近"的协调器.

It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.

集群中的节点使用Gossip 协议.gossiper 每秒运行一次,并确保所有节点都保持最新的数据来自Snitch 你已经配置好了.告密者会跟踪每个节点属于哪个数据中心和机架.

The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.

这样,协调器节点也有关于哪些节点负责每个令牌范围的数据.您可以通过从命令行运行 nodetool ring 来查看此信息.尽管如果您使用的是虚拟节点,那么确定起来会比较棘手,因为所有 256 个(默认)虚拟节点的数据会在屏幕上快速闪烁.

In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.

因此,假设我有一张表,我用它来按船员的名字跟踪船员,并假设我想查找 Malcolm Reynolds.运行此查询:

So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:

SELECT token(firstname),firstname, id, lastname 
FROM usersbyfirstname  WHERE firstname='Mal';

...返回这一行:

 token(firstname)     | firstname | id | lastname
----------------------+-----------+----+-----------
  4016264465811926804 |       Mal |  2 |  Reynolds

通过运行 nodetool ring 我可以看到哪个节点负责这个令牌:

By running a nodetool ring I can see which node is responsible for this token:

192.168.1.22  rack1       Up     Normal  348.31 KB   3976595151390728557                         
192.168.1.22  rack1       Up     Normal  348.31 KB   4142666302960897745                         

或者更简单,我可以使用 nodetool getendpoints 来查看这些数据:

Or even easier, I can use nodetool getendpoints to see this data:

$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
192.168.1.22

有关更多信息,请查看上面链接的一些项目,或尝试运行 nodetool gossipinfo.

For more information, check out some of the items linked above, or try running nodetool gossipinfo.

这篇关于cassandra 如何找到包含数据的节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆