Cassandra 内部存储 [英] Cassandra storage internal

查看:16
本文介绍了Cassandra 内部存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解当在 CQL 样式表中插入行(列)时,存储引擎级别内部究竟发生了什么.

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.

CREATE TABLE log_date (
  userid bigint,
  time timeuuid,
  category text,
  subcategory text,
  itemid text,
  count int,
  price int,
  PRIMARY KEY ((userid), time) - #1
  PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);

假设我有一个像上面这样的表.

Suppose that I have a table like above.

在 #1 的情况下,CQL 行将在存储中生成 6(或 5?)列.
在 #2 的情况下,CQL 行将在存储中生成一个非常复合的列.

In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.

我想知道将日志存储到 Cassandra 中的更有效方法是什么.
请关注给定的两种情况.
我不需要任何实时读取.只是文字.

I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.



如果您想建议其他选项,请参考以下内容.
我选择 Cassandra 存储日志的原因是



If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are

  1. 线性可扩展性,适合繁重的写作.
  2. 它在 CQL 中有架构.我真的更喜欢有一个架构.
  3. 似乎对 Spark 的支持足够好.Datastax 的 cassandra-spark 连接器似乎具有数据位置感知能力.

推荐答案

我试图了解当在 CQL 样式表中插入行(列)时,存储引擎级别内部究竟发生了什么.

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.

假设我用你的两个 PRIMARY KEY 构建表,并插入一些数据:

Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:

aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)
aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)

通过 cqlsh 看起来几乎相同.那么我们从cassandra-cli看一下,查询userid 1002的所有行:

Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)

够简单吧?我们将 userid 1002 视为 RowKey,将我们的聚类列 time 视为列键.接下来是每个列键(time)的所有列.而且我相信您的第一个实例会生成 6 列,因为我很确定其中包含列键的占位符,因为您的 PRIMARY KEY 可能指向一个空值(就像您的第二个示例键一样).

Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).

但是 userid 1002 的第二个版本呢?

But what about the 2nd version for userid 1002?

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)

为 RowKey 1002 返回两列,一列用于我们的列(集群)键的每个唯一组合,具有一个空值(如上所述).

Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).

那么这一切对你意味着什么?好吧,有几件事:

So what does this all mean for you? Well, a few things:

  • 这应该告诉您 Cassandra 中的主键可确保唯一性.因此,如果您决定需要更新诸如 categorysubcategory (第二个示例)之类的键值,除非您删除并重新创建该行,否则您确实无法更新.虽然从日志记录的角度来看,这可能没问题.
  • Cassandra 将特定分区/行键 (userid) 的所有数据存储在一起,按列(集群)键排序.如果您关心数据的查询和排序,请务必了解您必须查询每个特定的 userid 才能使排序顺序有所不同.
  • 我看到的最大问题是,现在您正在为无限制的列增长做好准备.分区/行键最多可支持 20 亿列,因此您的第二个示例将对您提供最大的帮助.如果您认为您的某些 userid 可能超过该值,您可以实现一个日期存储桶"作为附加分区键(例如,如果您知道 userid 永远不会一年超过 20 亿,或其他).
  • This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
  • Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
  • The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).

在我看来,您的第二个选项可能是更好的选择.但老实说,对于你正在做的事情,它们中的任何一个都可能工作正常.

It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.

这篇关于Cassandra 内部存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆