用于大列的 Cassandra Wide 与 Skinny Rows [英] Cassandra Wide Vs Skinny Rows for large columns

查看:15
本文介绍了用于大列的 Cassandra Wide 与 Skinny Rows的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每天需要向 cassandra 中插入 60GB 的数据.

I need to insert 60GB of data into cassandra per day.

这分解为
100套钥匙
每组 150,000 个键
每个键 4KB 数据

This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key

就写入性能而言,我最好使用
每组 1 行,每行 150,000 个键
每组 10 行,每行 15,000 个键
每组 100 行,每行 1,500 个键
每组 1000 行,每行 150 个键

In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row

要考虑的另一个变量,我的数据在 24 小时后过期,所以我使用 TTL=86400 自动过期

Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration

关于我的配置的更多具体细节:

More specific details about my configuration:

CREATE TABLE stuff (
  stuff_id text,
  stuff_column text,
  value blob,
  PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.100000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=39600 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

访问模式详细信息:
4KB 值是一组 1000 个 4 字节浮点数打包成一个字符串.

Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.

一个典型的请求需要随机选择 20 - 60 个这些浮点数.

A typical request is going to need a random selection of 20 - 60 of those floats.

最初,这些浮点数都存储在相同的逻辑行和列中.如果所有数据都写入具有 150,000 列的一行,则此处的逻辑行表示给定时间的一组数据.

Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.

随着时间的推移,一些数据被更新,在列集内的逻辑行内,打包字符串内的一组随机级别将被更新.新级别不是就地更新,而是与其他新数据一起写入新的逻辑行,以避免重写所有仍然有效的数据.这会导致碎片化,因为现在需要访问多行来检索那组 20 - 60 个值.现在,请求通常会从 1 - 5 行不同的同一列读取.

As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.

测试方法我为每个配置编写了 5 个随机数据样本,并对结果求平均值.费率计算为 (Bytes_written/(time * 10^6)).时间以秒为单位,精确到毫秒.Pycassa 被用作 Cassandra 接口.使用了 Pycassa 批量插入操作符.每次插入将多列插入到一行中,插入大小限制为 12 MB.队列刷新为 12MB 或更少.大小不考虑行和列开销,只考虑数据.数据源和数据接收器在不同系统上的同一网络上.

Test Method I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.

写入结果请记住,由于 Cassandra 配置的复杂性,还有许多其他变量在起作用.
1 行每行 150,000 个键:14 MBps
10 行每行 15,000 个键:15 MBps
100 行每行 1,500 个键:18 MBps
1000 行每行 150 个键:11 MBps

Write results Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps

推荐答案

答案取决于您的数据检索模式是什么,以及您的数据是如何逻辑分组的.总的来说,这是我的想法:

The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:

  • 宽行(每组 1 行):这可能是最好的解决方案,因为它可以防止请求同时访问多个节点,并且使用二级索引或复合列名称,您可以根据需要快速过滤数据.如果您需要访问每个请求的一组数据,这是最好的.但是,在宽行上执行过多的多重获取会增加节点的内存压力,并降低性能.
  • 窄行(每组 1000 行):另一方面,宽行会引起集群中的读取热点.如果您需要对完全存在于一个宽行中的数据子集进行大量请求,则尤其如此.在这种情况下,瘦行将在整个集群中更均匀地分布您的请求,并避免热点.此外,根据我的经验,更窄"的行在使用多重获取时往往表现得更好.

我建议,分析您的数据访问模式,并在此基础上确定您的数据模型,而不是相反.

I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.

这篇关于用于大列的 Cassandra Wide 与 Skinny Rows的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆