cassandra 分片和复制 [英] cassandra sharding and replication

查看:30
本文介绍了cassandra 分片和复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Cassandra 的新手,正在阅读 这篇文章 解释分片和复制,我被困在 -

我有一个集群,在我的本地机器上配置了 6 个 Cassandra 节点.我创建了一个新的键空间TestKeySpace",复制因子为 6,键空间employee"中的表和主键是名为 RID 的自动增量编号.我无法理解如何对这些数据进行分区和复制.我想知道的是,由于我将复制因子保持为 6,并且数据将分布在多个节点上,那么每个节点是否会与其他节点拥有完全相同的数据?

如果我的集群有以下配置怎么办 -

 节点数 - 6(n1、n2、n3、n4、n5 和 n6).复制因子 - 3.

我如何确定对于任何一个节点(假设是 n1),在其他两个节点上复制了数据,以及其他哪些节点作为不同的分片运行.

提前致谢.

问候,振动

PS - 如果有人对这个问题投反对票,请在评论中提及出了什么问题.

解决方案

我会用一个简单的例子来解释这个.cassandra 中的键空间相当于 RDBMS 中的数据库模式名称.

首先创建一个keyspace——

CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = {'类':'简单策略','replication_factor':3};

让我们创建一个简单的表格 -

创建表 USER_BY_USERID(用户 ID 整数,名称文字,电子邮件文本,PRIMARY KEY(用户名,名称)) 与聚类顺序 BY(name DESC);

在本例中,userid 是您的分区键,名称是集群键.分区也称为行键,这个键决定了行将保存在哪个节点上.

你的第一个问题 -

<块引用>

我无法理解如何对这些数据进行分区?

数据将根据您的分区键进行分区.默认情况下,C* 使用 Murmur3partitioner.您可以在 cassandra.yaml 配置文件中更改分区器.分区如何发生,也取决于您的配置.您可以为每个节点指定令牌范围,例如查看下面的 cassandra.yaml 配置文件.我在你的问题中指定了 6 个节点.

节点 0 的 cassandra.yaml:

cluster_name: 'MyCluster'初始令牌:0种子提供者:- 种子:198.211.xxx.0"监听地址:198.211.xxx.0rpc_address: 0.0.0.0endpoint_snitch:RackInferringSnitch

节点 1 的 cassandra.yaml:

cluster_name: 'MyCluster'初始令牌:3074457345618258602种子提供者:- 种子:198.211.xxx.0"监听地址:192.241.xxx.0rpc_address: 0.0.0.0endpoint_snitch:RackInferringSnitch

节点 2 的 cassandra.yaml:

cluster_name: 'MyCluster'初始令牌:6148914691236517205种子提供者:- 种子:198.211.xxx.0"监听地址:37.139.xxx.0rpc_address: 0.0.0.0endpoint_snitch:RackInferringSnitch

.......节点3......节点4......

节点 5 的 cassandra.yaml:

cluster_name: 'MyCluster'initial_token:{一些大数字}种子提供者:- 种子:198.211.xxx.0"监听地址:37.139.xxx.0rpc_address: 0.0.0.0endpoint_snitch:RackInferringSnitch

让我们使用这个插入语句 -

INSERT INTO USER_BY_USERID VALUES(1、达斯维德",darthveder@star-wars.com");

Partitioner 将计算 PARTITION 键的哈希值(在上面的示例中 userid - 1),并决定将保存该行的哪个节点.假设计算出的哈希值是 12345,该行将保存在节点 0(在上述配置中查找节点 0 的 initial_token 值).

完整的 cassandra.yaml 配置 configCassandra_yaml_r>

您可以通过这个deployCalcTokens了解如何生成令牌.

第二个问题 -

<块引用>

如何复制数据?

根据您的复制策略和复制因子,数据会在每个节点上复制.您必须在创建密钥空间时指定复制因子和复制策略.例如,在上面的例子中,我使用了 SimpleStrategy 作为复制策略.这种策略适用于小集群.对于地质分布的应用程序,您可以使用 NetworkTopologyStrategy.replication_factor 指定要创建多少行副本,在本例中,每行将创建三个副本.使用简单的策略,cassandra 将使用顺时针方向复制行.

在上面的示例中,该行保存在 Node0 中,并且同一个节点被复制到 Node1 和 Node2 上.再举个例子——

INSERT INTO USER_BY_USERID VALUES(448454,"欧比旺克诺比",obiwankenobi@star-wars.com");

对于用户 id 448454,计算出的哈希值是 3074457345618258609,所以这一行将保存在 Node2 中(在上面的配置中查找节点 2 的 initial_token 值),并以顺时针方向复制到 Node3 和 Node4(记住我们指定的复制因子为 3,因此只有三个副本 Noe2、Node3、Node4).

希望这会有所帮助.

I am new to Cassandra was going though this Article explaining sharding and replication and I am stuck at a point that is -

I have a cluster with 6 Cassandra nodes configured at my local machine. I create a new keyspace "TestKeySpace" with replication factor as 6 and a table in keyspace "employee" and primary key is auto-increment-number named RID. I am not able to understand how this data will be partitioned and replicated. What I want to know is since I am keeping my replication factor to be 6, and data will be distributed on multiple nodes, then will each node will be having exactly same data as the other nodes or not?

What If my cluster has following configuration -

    Number of nodes - 6 (n1, n2 ,n3, n4, n5 and n6).
    replication_factor - 3. 

How can I determine that for any one node (let say n1), on which other two nodes the data is replicated and which other nodes are behaving as different shards.

Thanks in Advance.

Regards, Vibhav

PS - If anybody down votes this question kindly do mention in comments what went wrong.

解决方案

I will explain this with simple example. A keyspace in cassandra is equivalent to database schema name in RDBMS.

First create a keyspace -

CREATE KEYSPACE MYKEYSPACE WITH REPLICATION = { 
 'class' : 'SimpleStrategy', 
 'replication_factor' : 3 
};

Lets create a simple table -

CREATE TABLE USER_BY_USERID(
 userid int,
 name text,
 email text,
 PRIMARY KEY(userid, name)
) WITH CLUSTERING ORDER BY(name  DESC);

In this example, userid is your partition key and name is clustering key. Partition is also called row key, this key determines on which node row will be saved.

Your first question -

I am not able to understand how this data will be partitioned?

Data will be partitioned based on your partition key. By default C* uses Murmur3partitioner. You can change the partitioner in cassandra.yaml configuration file. How partitions happens, is also depends on your configuration. You can specify range of tokens for each node, for example take a look at below cassandra.yaml configuration file. I have specified 6 node form your question.

cassandra.yaml for Node 0:

cluster_name: 'MyCluster'
initial_token: 0
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 198.211.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

cassandra.yaml for Node 1:

cluster_name: 'MyCluster'
initial_token: 3074457345618258602
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 192.241.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

cassandra.yaml for Node 2:

cluster_name: 'MyCluster'
initial_token: 6148914691236517205
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

.......Node3 ...... Node4 ....

cassandra.yaml for Node 5:

cluster_name: 'MyCluster'
initial_token: {some large number}
seed_provider:
    - seeds:  "198.211.xxx.0"
listen_address: 37.139.xxx.0
rpc_address: 0.0.0.0
endpoint_snitch: RackInferringSnitch

lets take this insert statement -

INSERT INTO USER_BY_USERID VALUES(
 1,
 "Darth Veder",
 "darthveder@star-wars.com"
);

Partitioner will calculate the hash of the PARTITION key (in above example userid - 1), and decides which node this row will be saved. Lets say calculated hash is something 12345, this row will be saved at Node 0 (look for the initial_token value for Node0 in above configuration).

Complete cassandra.yaml configuration configCassandra_yaml_r

You can go through this deployCalcTokens to know how to generate tokens.

Second question -

how data gets replicated?

Depending on your replication strategy and replication factor, the data gets replicated on each node. you have to specify Replication factor and replication strategy while creating keyspace. For example, in above example, I have used SimpleStrategy as replication strategy. This strategy is suitable for small cluster. For geologically distributed application you can use NetworkTopologyStrategy. replication_factor specifies, how many copies of a row to be created, in this example three copies of each row will be created. With simple strategy, cassandra will use clockwise direction to copy the row.

In above example, the row is saved at Node0 and the same node gets copied on Node1 and Node2. Let's take another example -

INSERT INTO USER_BY_USERID VALUES(
 448454,
 "Obi wan kenobi",
 "obiwankenobi@star-wars.com"
);

For user id 448454, the calculated hash is say 3074457345618258609, so this row will be save at Node2 (look for the initial_token value for node 2 in above configuration) and also get copied in clockwise direction to Node3 and Node4 (remember we have specified replication factor of 3, so only three copies Noe2, Node3, Node4).

Hope this helps.

这篇关于cassandra 分片和复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆