在cassandra中的时间序列建模(含开始和结束日期) [英] Time series modelling( with start & end date) in cassandra

查看:206
本文介绍了在cassandra中的时间序列建模(含开始和结束日期)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做时间序列数据建模,其中我有一个开始日期和结束日期的事件。我需要查询数据模型,如下:

 从tablename中选择* startdate>'2012-08-09' and enddate<'2012-09-09'

我参考了 cql where clause 但我无法实现这个。任何方式来做到这一点?我也可以更改数据模型或任何cql调整。我使用Cassandra 2.1。

解决方案

我不得不解决一个类似的问题,这是您可以完成此操作的一种方法...


我需要对数据模型进行查询,如下所示:从tablename中选择*,其中startdate>'2012-08-09'和enddate<'2012-09-09'


有两个建模问题阻止此查询工作。首先,要运行范围查询,您首先需要使用分区键限制查询。对于时间序列数据,最好的想法是创建一个称为时间桶的东西。对于这个例子,我将按月划分数据,使用 monthbucket 的分区键。



,是您只能对单个列/键值运行范围查询。当您想要通过开始和结束日期查询时,这会成为问题。一个解决方案是,将表中的每一行存储两次,并创建一个附加的聚集键以保存该行是开始行还是结束行的值。我将调用此列 beginend



给出这些注释,我将创建一个表这个:

  CREATE TABLE活动(
monthBucket TEXT,
eventDate TIMESTAMP,
beginEnd TEXT ,
eventid UUID,
eventName TEXT,
PRIMARY KEY(monthBucket,eventDate,beginEnd,eventid))
WITH CLUSTERING ORDER BY(eventDate DESC,beginEnd ASC,eventid ASC);




  • 对于大多数时间序列实现,最新数据。为此,我按照DESCending顺序聚集 eventDate

  • 此外,由于您可以同时开始多个事件,您还应该添加一个额外的集群密钥以确保唯一性( eventid 在这种情况下)。



插入一些行之后,让我们使用2015年9月的分区键查询:

  aploetz @ cqlsh: stackoverflow> SELECT * FROM events WHERE monthbucket ='201509'; 

monthbucket |事件日期| beginend | eventid | eventname
------------- + -------------------------- + ---- ------ + -------------------------------------- + ---- --------------------
201509 | 2015-09-25 23:59:59 + 0000 | E | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |霍比特节
201509 | 2015-09-25 00:00:00 + 0000 | B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |霍比特节
201509 | 2015-09-24 23:59:59 + 0000 | E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra峰会
201509 | 2015-09-22 00:00:00 + 0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra峰会
201509 | 2015-09-19 23:59:59 + 0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 |谈话像一个海盗节
201509 | 2015-09-19 00:00:00 + 0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 |像一个海盗节一样

(6行)

,假设我想在9月18日到9月24日之间查询事件:

  aploetz @ cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket ='201509'AND eventdate> '2015-09-18'AND eventdate< '2015-09-24'; 

monthbucket | eventdate | beginend | eventid | eventname
------------- + -------------------------- + ---- ------ + -------------------------------------- + ---- --------------------
201509 | 2015-09-22 00:00:00 + 0000 | B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 | Cassandra峰会
201509 | 2015-09-19 23:59:59 + 0000 | E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 |谈话像一个海盗节
201509 | 2015-09-19 00:00:00 + 0000 | B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 |像一个海盗节一样

(3排)

,我应该得到三行:Talk Like A Pirate Day的开始和结束行,以及2015 Cassandra Summit的开始行。



与所有数据建模方法,有待进行权衡。在这种情况下,为两个日期的查询建模,权衡是您必须复制您的行。当然,为了能够进行范围查询,你必须决定提供相关数据和所需查询灵活性的一个好的分区键( monthbucket )。



编辑以回答问题:



  1. 如果我想查找2015年11月25日至2016年11月25日之间的所有活动。
    这是怎么可能的?



需要找出您的应用程序的最佳时间桶。想想你最常见的查询,并建模。现在你不想在单个行(bucket)中存储太多,因为这将会杀死你的数据分布。



monthBucket 这个特殊情况下, d必须对每个单独的月份执行查询。我设计这个解决方案的应用程序从来没有一次看过整年的事件。如果这是你需要支持的查询模式,那么你需要让你的时间更大一些。




是否有任何方法从结果集中删除此重复行?

没有。需要在应用程序级别处理/忽略重复项。 Cassandra CQL有一个DISTINCT关键字,但它只对分区键起作用。



  1. 这种类型的合并可以在Cassandra级别完成吗?



不,Cassandra没有办法一起JOIN表。



在应用程序端处理数据(无论是加入还是过滤)通常不是一个好主意。但关键是温和。如果你查询20个事件,并且不得不忽略其中的一些,这不是太大的交易。但是查询20,000,000个事件并在该卷上应用一个应用程序进程并不能扩展得很好。同样,这里是您必须查看可用的选项,并决定什么将适用于您的应用程序。


I am doing time series data modelling where I have a start date and end date of events. I need to query on that data model like the following:

Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'

I referred to the following link on cql where clause but I couldn't achieve this. Any way to do that? I can also change the data model or any cql tweaks. I am using Cassandra 2.1.

解决方案

I had to solve a similar problem in one of my former positions. This is one way in which you could accomplish this...

I need to make query on that data model like the following: Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'.

There are two modeling problems preventing this query from working. First of all, to run a range query, you first need to limit your query with a partition key. With time series data the best idea is to create something called a time bucket. For this example I'll partition the data by month, with a partition key called monthbucket.

The other problem, is that you can only run a range query on a single column/key value. This becomes problematic when you want to query by both a start and end date. One solution, is to store each row in the table twice, and create an additional clustering key to hold the value of whether the row is the beginning row or the end row. I'll just call this column beginend.

Given those notes, I'll create a table that looks like this:

CREATE TABLE events (
  monthBucket TEXT,
  eventDate TIMESTAMP,
  beginEnd TEXT,
  eventid UUID,
  eventName TEXT,
  PRIMARY KEY (monthBucket, eventDate, beginEnd, eventid))
WITH CLUSTERING ORDER BY (eventDate DESC, beginEnd ASC, eventid ASC);

  • With most time series implementations, you tend to care more about the most-recent data. To that end, I am clustering on eventDate in DESCending order.
  • Also, as you could have multiple events starting at the same times, you should also add an additional clustering key to ensure uniqueness (eventid in this case).

After INSERTing some rows, let's just query by a partition key of September, 2015:

aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509';

 monthbucket | eventdate                | beginend | eventid                              | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
      201509 | 2015-09-25 23:59:59+0000 |        E | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |             Hobbit Day
      201509 | 2015-09-25 00:00:00+0000 |        B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |             Hobbit Day
      201509 | 2015-09-24 23:59:59+0000 |        E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
      201509 | 2015-09-22 00:00:00+0000 |        B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
      201509 | 2015-09-19 23:59:59+0000 |        E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
      201509 | 2015-09-19 00:00:00+0000 |        B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day

(6 rows)

Similar to your example, let's say that I want to query events between September 18th and September 24th:

aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509' AND eventdate > '2015-09-18' AND eventdate < '2015-09-24';

 monthbucket | eventdate                | beginend | eventid                              | eventname
-------------+--------------------------+----------+--------------------------------------+------------------------
      201509 | 2015-09-22 00:00:00+0000 |        B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
      201509 | 2015-09-19 23:59:59+0000 |        E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
      201509 | 2015-09-19 00:00:00+0000 |        B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day

(3 rows)

As you can see, I should get three rows: A beginning and an end row for "Talk Like A Pirate Day" and a beginning row for the 2015 Cassandra Summit.

As with all data modeling approaches, there are trade-offs to be made. In this case to model for querying on both dates, the trade-off is that you have to duplicate your rows. And of course, to be able to range query at all, you have to decide on a good partition key (monthbucket) that offers relevant data and the required query flexibility. In any case, give it a try and see if you can make it work for your use case.

Edit to answer questions:

  1. If I want to find all events between 25th Nov,2015 to 25th Nov,2016. How that could be possible ?

That's where you'd need to figure out the best time bucket for your application. Think about your most-common queries, and model off of that. Now you don't want to store too much in a single row (bucket), because that will kill your data distribution. So try to find a happy medium between query flexibility and data distribution.

In this particular case with monthBucket you'd have to execute a query for each individual month. The application that I designed this solution for never looked at an entire years' worth of events at once. If that's a query pattern you need to support, then you'll need to make your time bucket a little bigger.

  1. Is there any way to remove this duplicate row from the result set only?

Nope. Duplicates would need to be handled/ignored at the application level. Cassandra CQL does have a DISTINCT keyword, but it only functions on partition keys.

  1. Can this type of merging be done at the Cassandra level ?

No, Cassandra does not have a way to JOIN tables together. And application-side joins are possible, but don't perform well and are technically an anti-pattern.

Handling data on the application-side (whether joining or filtering) is typically not a good idea. But the key is moderation. If you query 20 events and have to ignore dupes for some of them, that's not too big of a deal. But querying 20,000,000 events and applying an application-side process at that volume is not going to scale well at all. Again, this is where you have to look at the options available, and decide what will work for your application.

这篇关于在cassandra中的时间序列建模(含开始和结束日期)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆