卡桑德拉内部存储 [英] Cassandra storage internal

查看:166
本文介绍了卡桑德拉内部存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想明白当行(列)插入一个定制列表样式表究竟是什么内部发生在存储引擎的水平。

  CREATE TABLE LOG_DATE(
  用户ID BIGINT,
  时间timeuuid,
  类文字,
  子类的文字,
  itemid的文字,
  诠释计数,
  价格INT,
  PRIMARY KEY((用户ID),时间) - #1
  PRIMARY KEY((用户ID),时间,类别,子类别,商品ID,计数,价格) - #2
);

假设我有一个表像上面。

在#1的情况下,一排CQL将产生6(或5?)的存储列。结果
在#2的情况下,一排CQL将产生存储一个非常复合列。

我不知道什么是存储登录到Cassandra的更有效的方法。结果
请关注那些给予两种情况。结果
我不需要任何实时读取。只是著作。

搜索结果
如果您想推荐其他选项,请参阅以下内容。结果
我选择了卡桑德拉用于存储日志的原因是


  1. 线性可扩展性和良好的重型写作。

  2. 它的模式在定制列表。我真的有$的模式p $ PFER。

  3. 似乎支持星火不够好。 Datastax的卡桑德拉火花连接器似乎有数据局部性意识。


解决方案

  

我想明白当行(列)插入一个定制列表样式表究竟是什么内部发生在存储引擎的水平。


让我们说,我建表既您的主键,并插入一些数据:

  aploetz @ cqlsh:stackoverflow2> SELECT用户ID,时间,dateof(时间),类别,子类别,商品ID,计数,价格从log_date1; 用户ID |时间| dateof(时间)|分类|子类别|的itemid |算上|价钱
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:48:20-0500 |书籍|小说| 678-2-44398-312-9 | 1 | 798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-07 08:49:45-0500 |音频|耳机| 228-5-44343-344-5 | 1 | 4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:43:23-0500 |书籍|计算机图书| 978-1-78398-912-6 | 1 | 2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:45:14-0500 |书籍|小说| 678-2-44398-312-9 | 1 | 798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:46:34-0500 |书籍|计算机图书| 977-8-78998-466-4 | 1 | 599(5行)
aploetz @ cqlsh:stackoverflow2> SELECT用户ID,时间,dateof(时间),类别,子类别,商品ID,计数,价格从log_date2; 用户ID |时间| dateof(时间)|分类|子类别|的itemid |算上|价钱
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:48:20-0500 |书籍|小说| 678-2-44398-312-9 | 1 | 798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-07 08:49:45-0500 |音频|耳机| 228-5-44343-344-5 | 1 | 4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:43:23-0500 |书籍|计算机图书| 978-1-78398-912-6 | 1 | 2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:45:14-0500 |书籍|小说| 678-2-44398-312-9 | 1 | 798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-07 08:46:34-0500 |书籍|计算机图书| 977-8-78998-466-4 | 1 | 599(5行)

看起来pretty大部分通过 cqlsh 相同。因此,让我们从卡桑德拉-CLI 一看,和查询所有行福尔用户ID 1002

  RowKey:1002
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:,值=,=时间戳1431092900008568)
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:类别,值= 426f6f6b73,时间戳= 1431092900008568)
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:算,值= 00000001,时间戳= 1431092900008568)
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:为itemid,值= 3637382d322d34343339382d3331322d39,时间戳= 1431092900008568)
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:价格,价值= 0000031e,时间戳= 1431092900008568)
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:子类别,值= 4e6f76656c73,时间戳= 1431092900008568)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:,值=,=时间戳1431092985326774)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:类别,值= 417564696f,时间戳= 1431092985326774)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:算,值= 00000001,时间戳= 1431092985326774)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:为itemid,值= 3232382d352d34343334332d3334342d35,时间戳= 1431092985326774)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:价格,价值= 000012bf,时间戳= 1431092985326774)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:子类别,值= 4865616470686f6e6573,时间戳= 1431092985326774)

够简单了吧?我们看到用户ID 1002为RowKey,我们的时间作为列键聚集列。在此之后,都是我们列的每个列键(时间)。我相信你的第一个实例生成6列,因为我pretty确保包括列密钥占位符,因为你的PRIMARY KEY可以指向一个空值(你的第二例的键一样)。

但是关于第二个版本用户ID 1002

什么

  RowKey:1002
= GT; (名称= e2f67ec0-f588-11e4-ade7-21b264d4c94d:书籍:小说:678-2-44398-312-9:1:798 :,值=,=时间戳1431093011349994)
= GT; (名称= 15d0fd20-f589-11e4-ade7-21b264d4c94d:音频:耳机:228-5-44343-344-5:1:4799 :,值=,=时间戳1431093011360402)

两列返回RowKey 1002,一个是我们的列(集群)的每个键巧妙结合,用空值(如上所述)。

那么,这一切对你意味着什么?那么,几件事情:


  • 这应该告诉你,在卡桑德拉主键确保唯一性。所以,如果你决定,你需要更新像键值类别子类(第二个例子),你真的可以' ŧ除非你删除并重新创建一行。虽然从记录来看,这可能是好的。

  • 卡桑德拉存储所有数据的特定分区/行键(用户ID )一起,由列(集群)键进行排序。如果你关心查询和排序数据,重要的是要明白,你必须查询每一个具体的用户ID 的排序顺序做出任何区别是很重要的。

  • 我看到的最大的问题,是不是你现在在和自己的无限柱增长。分区/行键最多2十亿列可以支持,所以你的第二个例子可以帮助你走出最那里。如果你觉得你的一些用户ID的 S可能会发生超出这一点,你可以实现一个约会斗作为一个附加的分区键(比方说,如果你知道一个用户ID 决不会在一年内超过200多十亿,或其他)。

在我看来喜欢你的第二个选项可能是更好的选择。但老实说,你在做什么,他们两人都可能工作正常。

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.

CREATE TABLE log_date (
  userid bigint,
  time timeuuid,
  category text,
  subcategory text,
  itemid text,
  count int,
  price int,
  PRIMARY KEY ((userid), time) - #1
  PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);

Suppose that I have a table like above.

In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.

I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.



If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are

  1. Linear scalability and good for heavy writing.
  2. It has schema in CQL. I really prefer having a schema.
  3. Seems to support Spark well enough. Datastax's cassandra-spark connector seems to have data locality awareness.

解决方案

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.

Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:

aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)
aploetz@cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;

 userid | time                                 | dateof(time)             | category | subcategory    | itemid            | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
   1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 |    Audio |     Headphones | 228-5-44343-344-5 |     1 |  4799
   1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 |    Books | Computer Books | 978-1-78398-912-6 |     1 |  2200
   1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 |    Books |         Novels | 678-2-44398-312-9 |     1 |   798
   1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 |    Books | Computer Books | 977-8-78998-466-4 |     1 |   599

(5 rows)

Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)

Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).

But what about the 2nd version for userid 1002?

RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)

Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).

So what does this all mean for you? Well, a few things:

  • This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
  • Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
  • The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).

It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.

这篇关于卡桑德拉内部存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆