Cassandra 帮助 : 使用组合键的任一部分支持快速查询 [英] Cassandra help : Supporting fast queries using either part of composite key

查看:16
本文介绍了Cassandra 帮助 : 使用组合键的任一部分支持快速查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Cassandra 的新手,不清楚存储数据以支持查询需求的最佳方式.我希望能够根据其中一个键或两者来搜索我的数据.为了说明,我将使用这个表格示例:

I'm new to Cassandra and was unclear on the best way to store my data to support my query needs. I want to be able to search my data based on either of the keys, or both. To illustrate I will use this table example:

CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);

这对于像这两个查询非常有用:

This works great for queries like these two:

SELECT event_time,temperature FROM temperature WHERE weatherstation_id=’1234ABCD’;

...因为它直接进入单个分区

...because it goes directly to a single partition

SELECT temperature FROM temperature WHERE weatherstation_id=’1234ABCD’ AND event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

...因为它仍然会转到单个分区并从有序列表中获取一部分结果

...because its still going to a single partition and getting a slice of results from an ordered list

但是,如果我想要这样的东西怎么办:

However what if I wanted to something like this:

SELECT temperature FROM temperature WHERE event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

如果我的理解对我有用,这会不会效率低下,因为它需要遍历每个分区?不仅如此,还需要采取措施使其按时间顺序恢复.

If my understanding serves me right, wouldn't this be inefficient since it would need to iterate over every partition? Not only that but it would then need to be resorted to get it back in time order.

解决这个问题的最佳设计是什么?

What's the best design to get around this?

推荐答案

实际上是您的查询:

SELECT temperature FROM temperature WHERE event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

将无法运行.Cassandra 确实必须知道必须在哪个分区中查找您请求的数据,即您总是必须指定分区键.

will fail to run. Cassandra really must know in which partition has to look for the data you're requesting, that is you always must specify the partition key.

为了有效地检索此查询的数据,您还需要围绕该查询对数据进行建模:

In order to efficiently retrieve data for this query you need to model your data around that query too:

CREATE TABLE temperature_by_time (
    granularity timestamp,
    event_time timestamp,
    weatherstation_id text,
    temperature text,
    PRIMARY KEY (granularity, event_time)    
);

这里我添加了字段granularity.此字段允许您控制分区的宽度.一个好的经验法则是每个分区中最多有大约 10k-100k 行.根据您写入此表的速度,您可以以不同的方式进行.示例:

Here I added the field granularity. This field allows you to control how wide your partitions will get. A good rule of thumb is to have at most around 10k-100k rows in each partition. Depending on how fast you write to this table you can proceed in different ways. Examples:

  • 您有 10 个传感器
  • 每个传感器每秒测量 1 次

在这种情况下,您将编写 10 小节/秒,36k 小节/小时.一个好的粒度值类似于 yyyy-mm-dd HH:00:00,也就是说,您可以按小时对数据进行分区:

In this case you're going to write 10 measures/second, 36k measures/hour. A good granularity value is then something like yyyy-mm-dd HH:00:00, that is you partition your data on hour-by-hour basis:

INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:05:01', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:19:15', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:39:35', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:59:49', ...);

SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00';
SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00' AND event_time >= '2017-01-1 10:30:00' AND event_time < '2017-01-1 11:00:00';

也就是说,您将 event_time 截断"为整数小时,并且只能获取每小时每小时的记录.

that is you "truncate" the event_time to the integer hour, and can get records on hour-per-hour only.

  • 您有 100 个传感器
  • 每个传感器每秒测量 1 次

在这种情况下,您将编写 100 小节/秒,360k 小节/小时.好的粒度值类似于 yyyy-mm-dd HH:00:00yyyy-mm-dd HH:15:00yyyy-mm-dd HH:30:00, yyyy-mm-dd HH:45:00,即您以四分之一小时为基础对数据进行分区:

In this case you're going to write 100 measures/second, 360k measures/hour. Good granularity values are then something like yyyy-mm-dd HH:00:00, yyyy-mm-dd HH:15:00, yyyy-mm-dd HH:30:00, yyyy-mm-dd HH:45:00, that is you partition your data on quarters of an hour basis:

INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:05:01', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:15:00', '2017-01-11 10:19:15', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:30:00', '2017-01-11 10:39:35', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:45:00', '2017-01-11 10:59:49', ...);

SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00';
SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00' AND event_time >= '2017-01-1 10:30:00' AND event_time < '2017-01-1 10:33:00';

也就是说,您将event_time截断"到四分之一小时,并且只能获取四分之一小时的记录.

that is you "truncate" the event_time to the quarter of the hour, and can get records on quarters of an hour only.

您已经知道如何继续...

You already know how to proceed...

这篇关于Cassandra 帮助 : 使用组合键的任一部分支持快速查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆