为什么在Cassandra中拥有大分区如此糟糕? [英] Why is it so bad to have large partitions in Cassandra?

查看:103
本文介绍了为什么在Cassandra中拥有大分区如此糟糕?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我到处都看到了这个警告,但是找不到关于此主题的任何详细说明。

I have seen this warning everywhere but cannot find any detailed explanation on this topic.

推荐答案

对于初学者


单个分区中的最大单元数(行x列)为
20亿。

The maximum number of cells (rows x columns) in a single partition is 2 billion.

如果允许分区无限增长,最终将达到此限制。

If you allow a partition to grow unbounded you will eventually hit this limitation.

在该理论限制之外,还有与大型分区对JVM和读取时间的影响有关的实际限制。这些实际限制在各个版本之间都在不断增加。这个实际的限制不是固定的,而是随数据模型,查询模式,堆大小和配置而变化的,这使得很难就太大的问题给出直接的答案。

Outside that theoretical limit, there are practical limitations tied to the impacts large partitions have on the JVM and read times. These practical limitations are constantly increasing from version to version. This practical limitation is not fixed but variable with data model, query patterns, heap size, and configurations which makes it hard to be give a straight answer on whats too large.

从2.1和3.0早期版本开始,读取和压缩的主要成本来自对索引进行反序列化,该索引将每个 column_index_size_in_kb 都标记为一行。您可以增加 key_cache_size_in_mb 来进行读取,以防止不必要的反序列化,但这会减少堆空间并填充旧的gen。您可以增加列索引的大小,但会增加最坏情况下读取时的IO成本。 CMS和G1还有许多不同的设置,可以在读取这些大分区时调整对象分配中大量峰值的影响。我们正在积极努力改善这一状况,因此将来可能不再是瓶颈。

As of 2.1 and early 3.0 releases, the primary cost on reads and compactions comes from deserializing the index which marks a row every column_index_size_in_kb. You can increase the key_cache_size_in_mb for reads to prevent unnecessary deserialization but that reduces heap space and fills old gen. You can increase the column index size but it will increase worst case IO costs on reads. Theres also many different settings for CMS and G1 to tune the impact of a huge spike in object allocations when reading these big partitions. There are active efforts on improving this so in the future it might no longer be the bottleneck.

修复也只能降到(在最佳情况下)分区级别。因此,如果说您一直在追加一个分区,并且在不精确的时间比较该分区在2个节点上的哈希值(分布式系统实质上保证了这一点),则必须对整个分区进行流传输以确保一致性。增量修复可以减少这种影响,但是您仍在流式传输大量数据和磁盘波动很大,然后需要将它们不必要地压缩在一起。

Repairs also only go down to (in best case scenario) the partition level. So if say you are constantly appending to a partition, and a hash of that partition on 2 nodes are compared at not an exact time (distributed system essentially guarantees this), the entire partition must be streamed over to ensure consistency. Incremental repairs can reduce impact of this, but your still streaming massive amounts of data and fluctuating disk significantly which will then need to be compacted together unnecessarily.

您可能会继续添加在有问题的极端情况和方案上。很多时候,可能都可以读取大型分区,但是其中涉及的调整和极端情况并不值得,最好仅设计数据模型以与Cassandra期望的方式友好。我建议目标为100mb,但您可以舒适地超越。进入Gbs,您需要开始考虑对其进行调整(取决于数据模型,用例等)。

You can probably keep adding onto this of corner cases and scenarios that have issues. Many times large partitions are possible to read, but the tuning and corner cases involved in them are not really worth it, better to just design data model to be friendly with how Cassandra expects it. I would recommend targeting 100mb but you can go far beyond that comfortably. Into the Gbs and you will need to start consider tuning for it (depending on data model, use case etc).

这篇关于为什么在Cassandra中拥有大分区如此糟糕?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆