用于时间序列数据的Cassandra分区键 [英] Cassandra partition key for time series data

查看:1122
本文介绍了用于时间序列数据的Cassandra分区键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试Cassandra作为时间序列数据库。

I'm testing Cassandra as time series database.

我创建的数据模型如下:

I create data model as below:

CREATE KEYSPACE sm WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 1
};

USE sm;

CREATE TABLE newdata (timestamp timestamp,
  deviceid int, tagid int,
  decvalue decimal,
  alphavalue text,
  PRIMARY KEY (deviceid,tagid,timestamp));

在主键中,我将deviceid设置为分区键,这意味着所有具有相同设备ID的数据写入一个节点(这意味着一个机器或一个分区,每个分区可以有最多20亿行),如果我在同一节点内查询数据,检索将快,我是正确的吗?我是Cassandra的新手,对分区键和聚集键有点困惑。

In the Primary key, I set deviceid as the partition key which mean all data with same device id will write into one node (does it mean one machine or one partition. Each partition can have max 2 billion rows) also if I query data within the same node, the retrieval will be fast, am I correct? I’m new to Cassandra and a bit confused about the partition key and clustering key.

我的查询大部分如下:


  • 选择已知deviceid和tagid的最新时间戳记

  • 选择已知deviceid和tagid和时间戳记的日期范围

  • 选择已知deviceid和tagid的字母表和时间戳

  • 选择已知时间范围内的deviceid和tagid

  • 时间范围

  • select lastest timestamp of know deviceid and tagid
  • Select decvalue of known deviceid and tagid and timestamp
  • Select alphavalue of known deviceid and tagid and timestamp
  • select * of know deviceid and tagid with time range
  • select * of known deviceid with time range

我将有大约2000个deviceid,每个deviceid将有60个tagid / value对。我不确定它是否将是一个宽行的deviceid,时间戳,tagid /值,tagid /值....

I will have around 2000 deviceid, each deviceid will have 60 tagid/value pair. I'm not sure if it will be a wide rows of deviceid, timestamp, tagid/value, tagid/value....

推荐答案


我是Cassandra的新手,对分区键和聚类键有点困惑。

I’m new to Cassandra and a bit confused about the partition key and clustering key.

听起来你理解分区键,所以我只是添加,你的分区键帮助Cassandra找出在集群中的哪个(哪个标记范围)存储您的数据。每个节点负责几个主令牌范围(假设vnodes)。当您的数据写入数据分区时,它将按照您的集群键进行排序。这也是它在磁盘上的存储方式,因此请记住,您的集群键决定了数据存储在磁盘上的顺序。

It sounds like you understand partition keys, so I'll just add that your partition key helps Cassandra figure out where (which token range) in the cluster to store your data. Each node is responsible for several primary token ranges (assuming vnodes). When your data is written to a data partition, it is sorted by your clustering keys. This is also how it is stored on-disk, so remember that your clustering keys determine the order in which your data is stored on disk.


每个分区最多可以有20亿行

Each partition can have max 2 billion rows

这不是真的。每个分区可以支持高达20亿个单元。单元本质上是列名/值对。和你的聚类键自己加起来一个单元格。因此,通过计算为每个CQL行存储的列值来计算单元格,并在使用聚簇列时添加一个。

That's not exactly true. Each partition can support up to 2 billion cells. A cell is essentially a column name/value pair. And your clustering keys add up to a single cell by themselves. So compute your cells by counting the column values that you store for each CQL row, and add one more if you use clustering columns.

根据宽行结构,可能有一个远远少于20亿行的限制。此外,这只是存储限制。即使您设法在单个分区中存储了100万个CQL行,查询该分区将返回如此多的数据,以至于它将是无意的,可能会超时。

Depending on your wide row structure you will probably have a limitation of far fewer than 2 billion rows. Additionally, that's just the storage limitation. Even if you managed to store 1 million CQL rows in a single partition, querying that partition would return so much data that it would be ungainly and probably time-out.


如果我在同一节点内查询数据,检索会很快,我是否正确?

if I query data within the same node, the retrieval will be fast, am I correct?

至少要比命中多个节点的多键查询快。但是它是否会快速取决于其他的事情,例如你的行是多大,以及你做多少事情像删除和就地更新。

It'll at least be faster than multi-key queries that hit multiple nodes. But whether or not it will be "fast" depends on other things, like how wide your rows are, and how often you do things like deletes and in-place updates.


我的查询大部分将如下所示:

Most of my query will be as below:

select lastest timestamp of know deviceid and tagid
Select decvalue of known deviceid and tagid and timestamp
Select alphavalue of known deviceid and tagid and timestamp
select * of know deviceid and tagid with time range
select * of known deviceid with time range


您当前的数据模型可以支持所有这些查询,为最后一个。为了在 timestamp 上执行范围查询,您需要将数据复制到新表中,并构建一个PRIMARY KEY以支持该查询模式。这称为基于查询的建模。我将构建一个这样的查询表:

Your current data model can support all of those queries, except for the last one. In order to perform a range query on timestamp, you'll need to duplicate your data into a new table, and build a PRIMARY KEY to support that query pattern. This is called "query-based modeling." I would build a query table like this:

CREATE TABLE newdata_by_deviceid_and_time (
  timestamp timestamp,
  deviceid int,
  tagid int,
  decvalue decimal,
  alphavalue text,
  PRIMARY KEY (deviceid,timestamp));

该表可以支持 timestamp ,而分区在 deviceid

That table can support a range query on timestamp, while partitioning on deviceid.

但是我看到的最大的问题是这些模型, 无界行增长。基本上,随着您为设备收集越来越多的值,您将接近每个分区20亿个单元格限制(再次,事情可能会在之前缓慢的方式)。你需要做的是使用一种称为时间分摊的建模技术。

But the biggest problem I see with either of these models, is that of "unbounded row growth." Basically, as you collect more and more values for your devices, you will approach the 2 billion cell limit per partition (and again, things will probably get slow way before that). What you need to do, is use a modeling technique called "time bucketing."

例如,我决定,远低于20亿个单元格限制允许我需要的日期范围灵活性类型。如果是这样,我将添加一个额外的分区键 monthbucket 和我的(新)表将如下所示:

For the example, I'll say that I determined that bucketing by month would keep me well under the 2 billion cells limit and allow for the type of date range flexibility that I needed. If so, I would add an additional partition key monthbucket and my (new) table would look like this:

CREATE TABLE newdata_by_deviceid_and_time (
  timestamp timestamp,
  deviceid int,
  tagid int,
  decvalue decimal,
  alphavalue text,
  monthbucket text,
  PRIMARY KEY ((deviceid,monthbucket),timestamp));

现在,当我想查询特定设备和日期范围内的数据时, monthbucket

Now when I wanted to query for data in a specific device and date range, I would also specify the monthbucket:

SELECT * FROM newdata_by_deviceid_and_time
WHERE deviceid='AA23' AND monthbucket='201603'
AND timestamp >= '2016-03-01 00:00:00-0500'
AND timestamp < '2016-03-16 00:00:00-0500';

请记住, monthbucket 对于你来说,使用四分之一甚至一年是更有意义的(假设你不会在一年中 deviceid 中存储太多的值)。

Remember, monthbucket is just an example. For you, it may make more sense to use quarter or even year (assuming that you don't store too many values per deviceid in a year).

这篇关于用于时间序列数据的Cassandra分区键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆