卡桑德拉宽VS瘦的行大列 [英] Cassandra Wide Vs Skinny Rows for large columns

查看:144
本文介绍了卡桑德拉宽VS瘦的行大列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要每天将60GB的数据插入cassandra。

I need to insert 60GB of data into cassandra per day.

这分为

100组键

每组150,000个密钥

每个密钥的4KB数据

This breaks down into
100 sets of keys
150,000 keys per set
4KB of data per key

在写性能方面,我最好使用

每行1行,每行150,000个键

每行10行,每行15,000个键

每行100行,每行1,500个键

1000行每组150个键

In terms of write performance am I better off using
1 row per set with 150,000 keys per row
10 rows per set with 15,000 keys per row
100 rows per set with 1,500 keys per row
1000 rows per set with 150 keys per row

另一个要考虑的变量,我的数据在24小时后过期,因此我使用TTL = 86400自动过期。

Another variable to consider, my data expires after 24 hours so I am using TTL=86400 to automate expiration

有关我的配置的更多具体细节:

More specific details about my configuration:

CREATE TABLE stuff (
  stuff_id text,
  stuff_column text,
  value blob,
  PRIMARY KEY (stuff_id, stuff_column)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.100000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=39600 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'tombstone_compaction_interval': '43200', 'class': 'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

访问模式详细信息

4KB值是一组1000个4字节的浮点数,打包成一个字符串。

Access pattern details:
The 4KB value is a set of 1000 4 byte floats packed into a string.

典型的请求需要随机选择20-60个浮动。

A typical request is going to need a random selection of 20 - 60 of those floats.

最初,这些浮动广告都存储在同一个逻辑行和列中。这里的逻辑行表示在给定时间的一组数据,如果它们都被写入具有150,000列的一行。

Initially, those floats are all stored in the same logical row and column. A logical row here represents a set of data at a given time if it were all written to one row with 150,000 columns.

随着时间的推移,一些数据被更新,在列集合中的逻辑行内,打包字符串中的一组随机数将被更新。代替原位更新,将新级别写入与其他新数据组合的新逻辑行,以避免重写所有仍然有效的数据。这导致碎片,因为现在需要访问多个行以检索该组20-60个值。请求现在通常从1 - 5个不同的行读取同一列。

As time passes some of the data is updated, within a logical row within the set of columns, a random set of levels within the packed string will be updated. Instead of updating in place, the new levels are written to a new logical row combined with other new data to avoid rewriting all of the data which is still valid. This leads to fragmentation as multiple rows now need to be accessed to retrieve that set of 20 - 60 values. A request will now typically read from the same column across 1 - 5 different rows.

测试方法
我为每个配置写了5个随机数据样本,并对结果进行了平均。速率计算为(Bytes_written /(时间* 10 ^ 6))。时间以秒为单位,以毫秒精度测量。 Pycassa用作Cassandra接口。使用Pycassa批插入算子。每个插入插入多个列到一行,插入大小限制为12 MB。队列以12MB或更小的速度刷新。尺寸不考虑行和列的开销,只是数据。数据源和数据接收器位于不同系统的同一网络上。

Test Method I wrote 5 samples of random data for each configuration and averaged the results. Rates were calculated as (Bytes_written / (time * 10^6)). Time was measured in seconds with millisecond precision. Pycassa was used as the Cassandra interface. The Pycassa batch insert operator was used. Each insert inserts multiple columns to a single row, insert sizes are limited to 12 MB. The queue is flushed at 12MB or less. Sizes do not account for row and column overhead, just data. The data source and data sink are on the same network on different systems.

写入结果
请记住,

1行每行150,000个密钥:14 MBps

10行每行15,000个密钥:15 MBps

100行每行1,500个密钥:18 MBps

1000行每行150个密钥:11 MB

Write results Keep in mind there are a number of other variables in play due to the complexity of the Cassandra configuration.
1 row 150,000 keys per row: 14 MBps
10 rows 15,000 keys per row: 15 MBps
100 rows 1,500 keys per row: 18 MBps
1000 rows 150 keys per row: 11 MBps

推荐答案

答案取决于您的数据检索模式,以及您的数据在逻辑上的分组。广义上,这是我的想法:

The answer depends on what your data retrieval pattern is, and how your data is logically grouped. Broadly, here is what I think:


  • 宽行(每行1行):这可能是最好的解决方案,从一次击中多个节点,使用辅助索引或复合列名称,您可以快速过滤数据以满足您的需求。如果您需要每个请求访问一组数据,这是最好的。但是,在宽行上执行太多的多网格可能会增加节点上的内存压力,并降低性能。

  • 瘦行(每行1000行):另一方面,上升到读取集群中的热点。如果您需要对完全在一个宽行中存在的数据子集进行大量请求,则尤其如此。在这种情况下,一个瘦的行将更均匀地分布您的请求在整个集群,并避免热点。

我建议,分析你的数据访问模式,并根据这一点完成数据模型,而不是相反。

I would suggest, analyze your data access pattern, and finalize your data model based on that, rather than the other way around.

这篇关于卡桑德拉宽VS瘦的行大列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆