Cassandra帮助:使用组合键的任一部分支持快速查询 [英] Cassandra help : Supporting fast queries using either part of composite key

查看:109
本文介绍了Cassandra帮助:使用组合键的任一部分支持快速查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Cassandra的新手,目前尚不清楚存储数据来满足查询需求的最佳方式。我希望能够基于两个键之一或两者来搜索我的数据。为了说明这一点,我将使用此表示例:

I'm new to Cassandra and was unclear on the best way to store my data to support my query needs. I want to be able to search my data based on either of the keys, or both. To illustrate I will use this table example:

CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);

这对以下两个查询非常有用:

This works great for queries like these two:

SELECT event_time,temperature FROM temperature WHERE weatherstation_id=’1234ABCD’;

...因为它直接进入单个分区

...because it goes directly to a single partition

SELECT temperature FROM temperature WHERE weatherstation_id=’1234ABCD’ AND event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

...因为它仍将转到单个分区并从有序列表中获得一部分结果

...because its still going to a single partition and getting a slice of results from an ordered list

但是,如果我想要这样的话怎么办:

However what if I wanted to something like this:

SELECT temperature FROM temperature WHERE event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

如果我的理解正确,那会不会效率低下,因为它需要遍历每一个划分?不仅如此,还需要采取措施使其按时间顺序恢复。

If my understanding serves me right, wouldn't this be inefficient since it would need to iterate over every partition? Not only that but it would then need to be resorted to get it back in time order.

解决这个问题的最佳设计是什么?

What's the best design to get around this?

推荐答案

实际上是您的查询:

SELECT temperature FROM temperature WHERE event_time > ’2013-04-03 07:01:00′ AND event_time < ’2013-04-03 07:04:00′;

将无法运行。 Cassandra确实必须知道必须在哪个分区中查找所需的数据,也就是说,您始终必须指定分区键。

will fail to run. Cassandra really must know in which partition has to look for the data you're requesting, that is you always must specify the partition key.

为了有效地检索此查询的数据,您还需要围绕该查询对数据建模:

In order to efficiently retrieve data for this query you need to model your data around that query too:

CREATE TABLE temperature_by_time (
    granularity timestamp,
    event_time timestamp,
    weatherstation_id text,
    temperature text,
    PRIMARY KEY (granularity, event_time)    
);

在这里我添加了粒度字段。该字段使您可以控制分区的宽度。一个好的经验法则是,每个分区中最多只能有1万至10万行。根据写入该表的速度,您可以采用不同的方式进行处理。示例:

Here I added the field granularity. This field allows you to control how wide your partitions will get. A good rule of thumb is to have at most around 10k-100k rows in each partition. Depending on how fast you write to this table you can proceed in different ways. Examples:


  • 您有10个传感器

  • 每个传感器每秒测量1次

您将要编写10个小节/秒,36k个小节/小时。好的粒度值类似于 yyyy-mm-dd HH:00:00 ,即您按小时对数据进行分区:

In this case you're going to write 10 measures/second, 36k measures/hour. A good granularity value is then something like yyyy-mm-dd HH:00:00, that is you partition your data on hour-by-hour basis:

INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:05:01', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:19:15', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:39:35', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:59:49', ...);

SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00';
SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00' AND event_time >= '2017-01-1 10:30:00' AND event_time < '2017-01-1 11:00:00';

这就是您截断 event_time 到整数小时,并且只能每小时获取记录。

that is you "truncate" the event_time to the integer hour, and can get records on hour-per-hour only.


  • 您有100个传感器

  • 每个传感器每秒提供1个测量值

在这种情况下,您将要写每秒100小节,每小时要写36万小节。好的粒度值应类似于 yyyy-mm-dd HH:00:00 yyyy-mm-dd HH:15:00 yyyy-mm-dd HH:30:00 yyyy-mm-dd HH:45:00 ,也就是说,您每隔一刻钟对数据进行分区:

In this case you're going to write 100 measures/second, 360k measures/hour. Good granularity values are then something like yyyy-mm-dd HH:00:00, yyyy-mm-dd HH:15:00, yyyy-mm-dd HH:30:00, yyyy-mm-dd HH:45:00, that is you partition your data on quarters of an hour basis:

INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:00:00', '2017-01-11 10:05:01', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:15:00', '2017-01-11 10:19:15', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:30:00', '2017-01-11 10:39:35', ...);
INSERT INTO temperature_by_time (granularity, event_time, ..) VALUES ('2017-01-11 10:45:00', '2017-01-11 10:59:49', ...);

SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00';
SELECT * FROM temperature_by_time WHERE granularity='2017-01-11 10:00:00' AND event_time >= '2017-01-1 10:30:00' AND event_time < '2017-01-1 10:33:00';

这就是您截断 event_time

您已经知道如何进行...

You already know how to proceed...

这篇关于Cassandra帮助:使用组合键的任一部分支持快速查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆