了解Cassandra的存储开销 [英] Understanding Cassandra's storage overhead

查看:184
本文介绍了了解Cassandra的存储开销的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读本部分的Cassandra文档并发现以下有点困惑:

I have been reading this section of the Cassandra docs and found the following a little puzzling:


确定列开销:

Determine column overhead:

regular_total_column_size = column_name_size + column_value_size + 15

regular_total_column_size = column_name_size + column_value_size + 15

计数器 - expiring_total_column_size = column_name_size + column_value_size + 23

counter - expiring_total_column_size = column_name_size + column_value_size + 23

Cassandra产生15字节的开销。由于表中的每一行可以具有不同的列名称以及不同的列数,因此为每一列存储元数据。对于计数器列和到期列,您应该再添加8个字节(共23个字节)。

Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different column names as well as differing numbers of columns, metadata is stored for each column. For counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).

CQL3定义的模式,例如:

The way I interpret the above for a CQL3 defined schema such as:

CREATE TABLE mykeyspace.mytable(
  id text,
  report_id text,
  subset_id text,
  report_date timestamp,
  start_date timestamp,
  end_date timestamp,
  subset_descr text,
  x int,
  y double,
  z int,
  PRIMARY KEY (id, report_id, subset_id)
);

是每行都包含列名的元数据,例如字符串 report_date start_date end_date 等及其类型。然而,我不清楚,这意味着表中的每一行可以有不同的列名。这听起来错误给我上面的模式是完全静态,即Cassandra 2.0将最肯定的抱怨,如果我尝试写:

is that each row will contain the metadata for the column names, e.g., the strings report_date, start_date, end_date, etc. and their type along with the data. However, it's not clear to me what it means that each row in a table can have different column names. This sounds wrong to me given the schema above is totally static, i.e., Cassandra 2.0 will most certainly complain if I try to write:

INSERT INTO mykeyspace.mytable (id, report_id , subset_id, x, y, z, w) 
VALUES ( 'asd','qwe','rty',100,1.234,12, 123.123);

Bad Request: Unknown identifier w

名称是固定的给定此表模式,因此元数据不应该需要存储每行。我猜想文档中的短语是过时的(它与Cassandra 1.2相同),或者我在这里误解了一些核心概念。

Now it looks to me like column names are fixed given this table schema and thus the metadata should not need to be stored per each row. I am guessing either the phrasing in the documentation is outdated (it's the same as Cassandra 1.2) or I'm misunderstanding some core concept at work here.

有人可以澄清吗?底线:我必须担心我的列的名称的长度或不?

Can anybody clarify? Bottom line: do I have to worry about the length of the names of my columns or not?

我们一直在玩它的​​安全,并在可能的情况下使用单字符名称(所以上面的列实际上是 i r s dr ds de sd ,...)

We have been playing it safe and used single character names where possible (so the above columns would actually be i, r, s, dr, ds, de, sd, ...), but it's so non human unreadable and can be confusing to work with.

推荐答案

在这种情况下,最容易了解情况的最简单的方法是以检查数据的sstable2json(cassandra / bin)表示。

The easiest way to figure out what is going on in situations like this is to check the sstable2json (cassandra/bin) representation of your data. This will show you what ends up actually be saved on disk.

以下是您的情况示例

 [
 {"key": "4b6579","columns": [
       ["rid1:ssid1:","",1401469033325000],
       ["rid1:ssid1:end_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:report_date","2004-10-03 00:00:00-0700",1401469033325000],
       ["rid1:ssid1:start_date","2004-10-03 00:00:00-0700",1401469033325000], 
       ["rid1:ssid1:subset_descr","descr",1401469033325000],
       ["rid1:ssid1:x","1",1401469033325000], 
       ["rid1:ssid1:y","5.5",1401469033325000],
       ["rid1:ssid1:z","1",1401469033325000],
       ["rid2:ssid2:","",1401469938599000],
       ["rid2:ssid2:end_date", "2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:report_date","2004-10-03 00:00:00-0700",1401469938599000],
       ["rid2:ssid2:start_date","2004-10-03 00:00:00-0700",1401469938599000], 
       ["rid2:ssid2:subset_descr","descr",1401469938599000],
       ["rid2:ssid2:x","1",1401469938599000],
       ["rid2:ssid2:y","5.5",1401469938599000],
       ["rid2:ssid2:z","1",1401469938599000]
 }
 ]

的分区键每个分区(每个sstable)保存一次,如上所示,在这种情况下,列名不重要,因为它是隐式给定的表。集群列的列名称也不存在,因为使用C *,不允许在不指定键的所有部分的情况下插入。

The value of the partition key is saved once per partition (per sstable) as you can see above, the column name in this case doesn't matter at all since it is implicit given the table. The column names for the clustering columns are also not present because with C* you aren't allowed to insert without specifying all portions of the key.

剩下的就是有列名,这是必需的,因为对行进行了部分更新,所以它可以保存而没有剩余的行信息。你可以想象对一行中单个列字段的更新,以指示哪个字段是C *当前使用列名称,但有票可以将其更改为更小的表示。
https://issues.apache.org/jira/browse/CASSANDRA-4175

Whats left though does have the column name, this is needed incase a partial update to a row is made so it can be saved without the rest of the row information. You could imagine an update to a single column field in a row, to indicate which field this is C* currently uses the column name but there are tickets to change this to a smaller representation. https://issues.apache.org/jira/browse/CASSANDRA-4175

生成此

cqlsh
CREATE TABLE mykeyspace.mytable(   id text,   report_id text,   subset_id text,   report_date timestamp,   start_date timestamp,   end_date timestamp,   subset_descr text,   x int,   y double,   z int,   PRIMARY KEY (id, report_id, subset_id) );
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid1','ssid1', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
INSERT INTO mykeyspace.mytable (id, report_id , subset_id , report_date , start_date , end_date , subset_descr ,x, y, z) VALUES ( 'Key', 'rid2','ssid2', '2004-10-03','2004-10-03','2004-10-03','descr',1,5.5,1);
exit;
nodetool flush
bin/sstable2json $DATA_DIR/mytable/mykeyspace-mytable-jb-1-Data.db 

这篇关于了解Cassandra的存储开销的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆