cassandra 中的时间序列建模(带有开始和结束日期) [英] Time series modelling( with start & end date) in cassandra

查看:29
本文介绍了cassandra 中的时间序列建模(带有开始和结束日期)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做时间序列数据建模,其中有事件的开始日期和结束日期.我需要查询该数据模型,如下所示:

I am doing time series data modelling where I have a start date and end date of events. I need to query on that data model like the following:

Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'

我参考了以下关于 cql where 子句 但我无法做到这一点.有没有办法做到这一点?我还可以更改数据模型或任何 cql 调整.我使用的是 Cassandra 2.1.

I referred to the following link on cql where clause but I couldn't achieve this. Any way to do that? I can also change the data model or any cql tweaks. I am using Cassandra 2.1.

推荐答案

在我以前的一个职位中,我不得不解决一个类似的问题.这是您可以完成此任务的一种方式...

I had to solve a similar problem in one of my former positions. This is one way in which you could accomplish this...

我需要对该数据模型进行如下查询:Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'.>

I need to make query on that data model like the following: Select * from tablename where startdate>'2012-08-09' and enddate<'2012-09-09'.

有两个建模问题阻止了此查询的工作.首先,要运行范围查询,首先需要使用分区键限制查询.对于时间序列数据,最好的想法是创建称为时间段的东西.对于本示例,我将按月对数据进行分区,使用名为 monthbucket 的分区键.

There are two modeling problems preventing this query from working. First of all, to run a range query, you first need to limit your query with a partition key. With time series data the best idea is to create something called a time bucket. For this example I'll partition the data by month, with a partition key called monthbucket.

另一个问题是,您只能对单个列/键值运行范围查询.当您想通过开始日期和结束日期进行查询时,这就会出现问题.一种解决方案是将表中的每一行存储两次,并创建一个额外的集群键来保存该行是开始行还是结束行的值.我将把这个列称为 beginend.

The other problem, is that you can only run a range query on a single column/key value. This becomes problematic when you want to query by both a start and end date. One solution, is to store each row in the table twice, and create an additional clustering key to hold the value of whether the row is the beginning row or the end row. I'll just call this column beginend.

根据这些注释,我将创建一个如下所示的表格:

Given those notes, I'll create a table that looks like this:

CREATE TABLE events (
  monthBucket TEXT,
  eventDate TIMESTAMP,
  beginEnd TEXT,
  eventid UUID,
  eventName TEXT,
  PRIMARY KEY (monthBucket, eventDate, beginEnd, eventid))
WITH CLUSTERING ORDER BY (eventDate DESC, beginEnd ASC, eventid ASC);

  • 在大多数时间序列实现中,您往往更关心最新数据.为此,我按 DESCending 顺序对 eventDate 进行聚类.
  • 此外,由于您可能有多个事件同时开始,您还应该添加一个额外的集群键以确保唯一性(在本例中为 eventid).
    • With most time series implementations, you tend to care more about the most-recent data. To that end, I am clustering on eventDate in DESCending order.
    • Also, as you could have multiple events starting at the same times, you should also add an additional clustering key to ensure uniqueness (eventid in this case).
    • 在插入一些行后,让我们只查询 2015 年 9 月的分区键:

      After INSERTing some rows, let's just query by a partition key of September, 2015:

      aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509';
      
       monthbucket | eventdate                | beginend | eventid                              | eventname
      -------------+--------------------------+----------+--------------------------------------+------------------------
            201509 | 2015-09-25 23:59:59+0000 |        E | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |             Hobbit Day
            201509 | 2015-09-25 00:00:00+0000 |        B | a223ad16-2afd-4213-bee3-08a2c4dd63e6 |             Hobbit Day
            201509 | 2015-09-24 23:59:59+0000 |        E | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
            201509 | 2015-09-22 00:00:00+0000 |        B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
            201509 | 2015-09-19 23:59:59+0000 |        E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
            201509 | 2015-09-19 00:00:00+0000 |        B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
      
      (6 rows)
      

      与您的示例类似,假设我想查询 9 月 18 日至 9 月 24 日之间的事件:

      Similar to your example, let's say that I want to query events between September 18th and September 24th:

      aploetz@cqlsh:stackoverflow> SELECT * FROM events WHERE monthbucket='201509' AND eventdate > '2015-09-18' AND eventdate < '2015-09-24';
      
       monthbucket | eventdate                | beginend | eventid                              | eventname
      -------------+--------------------------+----------+--------------------------------------+------------------------
            201509 | 2015-09-22 00:00:00+0000 |        B | 9cd6a265-6c60-4537-9ea9-b57e7c152db9 |       Cassandra Summit
            201509 | 2015-09-19 23:59:59+0000 |        E | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
            201509 | 2015-09-19 00:00:00+0000 |        B | b9fe9668-cef2-464e-beb4-d4f985ef9c47 | Talk Like a Pirate Day
      
      (3 rows)
      

      如您所见,我应该得到三行:Talk Like A Pirate Day"的开始行和结束行以及 2015 Cassandra Summit 的开始行.

      As you can see, I should get three rows: A beginning and an end row for "Talk Like A Pirate Day" and a beginning row for the 2015 Cassandra Summit.

      与所有数据建模方法一样,需要进行权衡.在这种情况下,要为两个日期的查询建模,权衡是您必须复制行.当然,为了能够进行范围查询,您必须决定一个好的分区键 (monthbucket),以提供相关数据和所需的查询灵活性.无论如何,请尝试一下,看看您是否可以使其适用于您的用例.

      As with all data modeling approaches, there are trade-offs to be made. In this case to model for querying on both dates, the trade-off is that you have to duplicate your rows. And of course, to be able to range query at all, you have to decide on a good partition key (monthbucket) that offers relevant data and the required query flexibility. In any case, give it a try and see if you can make it work for your use case.

      编辑以回答问题:

      如果我想查找 2015 年 11 月 25 日至 2016 年 11 月 25 日之间的所有事件.这怎么可能?

      If I want to find all events between 25th Nov,2015 to 25th Nov,2016. How that could be possible ?

      这就是您需要确定应用程序的最佳时间段的地方.想想你最常见的查询,并以此为模型.现在您不想在一行(存储桶)中存储太多内容,因为这会破坏您的数据分布.所以试着在查询灵活性和数据分布之间找到一个愉快的媒介.

      That's where you'd need to figure out the best time bucket for your application. Think about your most-common queries, and model off of that. Now you don't want to store too much in a single row (bucket), because that will kill your data distribution. So try to find a happy medium between query flexibility and data distribution.

      在使用 monthBucket 的这种特殊情况下,您必须为每个月执行查询.我为这个解决方案设计的应用程序从来没有同时查看过一整年的事件.如果这是您需要支持的查询模式,那么您需要将时间段扩大一些.

      In this particular case with monthBucket you'd have to execute a query for each individual month. The application that I designed this solution for never looked at an entire years' worth of events at once. If that's a query pattern you need to support, then you'll need to make your time bucket a little bigger.

      有没有办法只从结果集中删除这个重复的行?

      Is there any way to remove this duplicate row from the result set only?

      没有.需要在应用程序级别处理/忽略重复项.Cassandra CQL 确实有一个 DISTINCT 关键字,但它只对分区键起作用.

      Nope. Duplicates would need to be handled/ignored at the application level. Cassandra CQL does have a DISTINCT keyword, but it only functions on partition keys.

      这种类型的合并可以在 Cassandra 级别完成吗?

      Can this type of merging be done at the Cassandra level ?

      不,Cassandra 无法将表连接在一起.应用程序端连接是可能的,但性能不佳,并且在技术上是一种反模式.

      No, Cassandra does not have a way to JOIN tables together. And application-side joins are possible, but don't perform well and are technically an anti-pattern.

      在应用程序端处理数据(无论是加入还是过滤)通常不是一个好主意.但关键是适度.如果您查询 20 个事件并且必须忽略其中一些事件的欺骗,那没什么大不了的.但是查询 20,000,000 个事件并在该数量上应用应用程序端进程根本不会很好地扩展.同样,您必须在此处查看可用选项,并决定哪些选项适用于您的应用程序.

      Handling data on the application-side (whether joining or filtering) is typically not a good idea. But the key is moderation. If you query 20 events and have to ignore dupes for some of them, that's not too big of a deal. But querying 20,000,000 events and applying an application-side process at that volume is not going to scale well at all. Again, this is where you have to look at the options available, and decide what will work for your application.

      这篇关于cassandra 中的时间序列建模(带有开始和结束日期)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆