Cassandra负载平衡与TokenAwarePolicy和shuffleReplicas [英] Cassandra load balancing with TokenAwarePolicy and shuffleReplicas

查看:774
本文介绍了Cassandra负载平衡与TokenAwarePolicy和shuffleReplicas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有6个节点群集,我们将所有部署到具有3个可用区的AWS上的一个区域。我们使用Ec2Snitch,它应该在每个可用区域中分发一个副本。我们使用DataStax Java驱动程序。执行写入和读取的服务器分布在与节点相同的可用区域中(1个服务器由AZ提供)。我们想要实现的是最好的读取性能,为我们写的不是那么重要,在某种意义上,我们需要写入数据,但不是必要的快。我们使用复制因子3,但使用一致性级别ONE读写。

We have 6 node cluster where we deploy everything to one region on AWS with 3 Availability Zones. We are using Ec2Snitch which should distribute one replica in each availability zone. We use DataStax Java driver. Servers doing write and read are distributed in availability zones same as nodes are (1 server by AZ). What we want to achieve is best possible read performance, write for us is not that important in a sense that we need to write data but not necessary fast. We use replication factor 3 but read and write with consistency level ONE.

我们正在研究 TokenAwarePolicy 。在DataStax Java驱动程序中说,它可以提高读取性能,但会降低写入分配。

We are investigating shuffle replicas in TokenAwarePolicy. It is said in DataStax Java Driver that it can increase read performance but decrease write distribution.

第一个问题 shuffleReplicas 实现,我遵循 newQueryPlan 方法的实现,我想出的是对于副本 LinkedHashSet

First question is about shuffleReplicas implementation, I followed implementation of newQueryPlan method and what I figured out is that for replicas LinkedHashSet is used meaning that primary replica will be always preferred to non primary replica.

// Preserve order - primary replica will be first
Set<Host> replicas = new LinkedHashSet<Host>();

只是为了确认,这意味着驱动程序总是更喜欢连接到主副本所在的节点,如果我们将 shuffleReplicas 设置为false,这可以创建热点?

Just to confirm, that will mean that driver will always prefer to connect to node where primary replica is, to have it as coordinator, if we set shuffleReplicas to false, which can create hot spots?

第二个问题是关于将连接分离到集群的想法,对于写,在true上使用 shuffleReplicas ,这将在集群之间均匀分配令牌, c $ c> TokenAwarePolicy 与 shuffleReplicas 为了获得最好的可能读取,这个想法可行,你看到任何问题吗?

Second question is about idea to separate connection to cluster, and for writes use shuffleReplicas on true, which will distribute evenly tokens across cluster and for read to use TokenAwarePolicy with shuffleReplicas on false to gain best possible reads, is this idea viable and do you see any problems with it?

我们希望始终从同一可用区读取数据,以便在读取数据时获得最大可能的速度。这是更好的方法,然后留下 shuffleReplicas 为真,让集群选择协调员均匀。想法也可以使用 WhiteListPolicy ,它将只选择从同一个AZ的节点到放置在该AZ中的服务器,这将导致本地读取,但可以创建热点。

We would like to have reads always from same availability zone to gain maximum possible speed while reading data. Is this better approach then leaving shuffleReplicas on true and letting cluster choose coordinator evenly. Idea can be also to use WhiteListPolicy which will select only nodes from same AZ to servers placed in that AZ which will result in local read but that can create hot spots.

推荐答案


只是为了确认,这意味着驱动程序总是更喜欢连接到主副本所在的节点,如果我们将shuffleReplicas设置为false,这可以创建热点?

Just to confirm, that will mean that driver will always prefer to connect to node where primary replica is, to have it as coordinator, if we set shuffleReplicas to false, which can create hot spots?

是的。注意,这只创建热点,只有当所有的分区键映射到同一个副本;

Yes. Note however that this creates hot spots only if all your partition keys map to the same replica; if your partition keys are evenly distributed across the token ring, it should be fine.


第二个问题是关于将连接分离到集群的想法,并且对于写使用shuffleReplicas使用shuffleReplicas对true,这将分布均匀的标记跨集群和阅读使用shuffleReplicas为假,以获得最佳可能读取的TokenAwarePolicy,这个想法可行,你看到任何问题吗?

Second question is about idea to separate connection to cluster, and for writes use shuffleReplicas on true, which will distribute evenly tokens across cluster and for read to use TokenAwarePolicy with shuffleReplicas on false to gain best possible reads, is this idea viable and do you see any problems with it?

我看到的主要问题是,驱动程序不能告诉一个请求是读取还是写入,所以你必须可以写自己的负载均衡策略,也可以使用两个单独的 Cluster 实例,一个用于读取,一个用于写入。

The main problem I see is that the driver is not capable of telling if a request is a "read" or a "write", so you will have to either write your own load balancing policy, or use two separate Cluster instances, one for reads, one for writes.

否则,将 shuffleReplicas 设置为 false 不一定意味着你将获得最好的读取。使用 shuffleReplicas 时的主要效果是最终的一致性;当 shuffleReplicas 为真时,可以读取过时的值,例如。如果写入具有一致性ONE的副本1,则从具有一致性ONE的副本2读取。我通常建议为读取和写入将 shuffleReplicas 设置为 true ,以便在群集上均匀分布负载,并调整您的一致性级别,以获得吞吐量与读取失效值的风险之间的最佳平衡。

Otherwise, setting shuffleReplicas to false doesn't necessarily mean you will get "best possible reads". The main effect to consider when using shuffleReplicas is eventual consistency; when shuffleReplicas is true, it is possible to read stale values, e.g. if you write to replica 1 with consistency ONE, then read from replica 2 with consistency ONE. I usually recommend to set shuffleReplicas to true for both reads and writes to spread the load evenly on your cluster, and adjust your consistency levels to get the best balance between throughput vs risk of reading stale values.

这篇关于Cassandra负载平衡与TokenAwarePolicy和shuffleReplicas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆