使用Cassandra获取时间序列数据 [英] Using Cassandra for time series data

查看:431
本文介绍了使用Cassandra获取时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究将日志存储到Cassandra。

日志的架构应如下所示。

I'm on my research for storing logs to Cassandra.
The schema for logs would be something like this.

编辑:我为了澄清起见,已经更改了模式。

I've changed the schema in order to make some clarification.

CREATE TABLE log_date (
  userid bigint,
  time timeuuid,
  reason text,
  item text,
  price int,
  count int,
  PRIMARY KEY ((userid), time) - #1
  PRIMARY KEY ((userid), time, reason, item, price, count) - #2
);

每天将创建一个新表。
因此,一个表仅包含一天的日志。

A new table will be created for the day everyday. So a table contains logs for only one day.

我的查询条件如下。

查询特定用户的所有日志

因此,根本不会将原因,项目,价格,数量用作查询的提示或条件。

My querying condition is as follows.
Query all logs from a specific user on a specific day(date not time).
So the reason, item, price, count will not be used as hints or conditions for queries at all.

我的问题是哪种主键设计更适合。

编辑:而且这里的关键是我想以示意图的方式存储日志。

My Question is which PRIMARY KEY design suits better.
And the key here is that I want to store the logs in a schematic way.

如果我选择#1,那么每个日志将创建许多列。每个日志具有更多值的可能性非常高。上面的模式只是一个例子。日志中可以包含诸如subreason,friendid等值。

If I choose #1 so many columns would be created per log. And the possibility of having more values per log is very high. The schema above is just an example. The log can contain values like subreason, friendid and so on.

如果我选择#2,则每个日志都会创建一个(非常)复合列,到目前为止,我无法找不到有关复合列开销的任何有价值的信息。

If I choose #2 one (very) composite column will be created per log, and so far I couldn't find any valuable information about the overhead of the composite columns.

我应该选择哪个?请帮忙。

Which one should I choose? Please help.

推荐答案

我的建议是,您的两种选择似乎都不适合您的时间序列,事实是

My advise is that none of your two options seems to be ideal for your time-series, the fact the you're creating a table per-day, doesn't seem optimal either.

相反,我建议您按照userid,day和use创建一个Table和分区时间uuid作为事件的聚集列,其示例如下所示:

Instead I'd recommend to create a single Table and partition by userid and day and use a time uuids as the clustered column for the event, an example of this would look like:

CREATE TABLE log_per_day (
   userid bigint,
   date text, 
   time timeuuid, 
   value text,
      PRIMARY KEY ((userid, date), time)
)

这将使您可以将一天中的所有事件都放在一行中,并允许您每位用户每天进行查询

This will allow you to have all events in a day in a single row and allow you to do your query per day per user.

通过声明时间群集列,您可以在宽行中插入多个事件。

By declaring the time clustered column allows to have a wide row where you can insert as a many events as you need in a day.

所以行键是用户ID的复合键,加上日期文字例如

So the row key is a composite key of the userid and plus date in text e.g.

插入log_per_day(用户ID,日期,时间,值)值(1000,'2015-05-06',aTimeUUID1,'my value ')

插入log_per_day(用户ID,日期,时间,值)值(1000,'2015-05 -06',aTimeUUID2,'my value2')

上面的两个插入将在同一行中,因此您将能够

The two inserts above will be in the same row and therefore you will be able to read in a single query.

此外,如果您想了解有关时间序列的更多信息,我强烈建议您检查时序数据建模入门

Also if you want more information about time series I highly recommend you to check Getting Started with Time Series Data Modeling

希望它帮助,

何塞·路易斯

这篇关于使用Cassandra获取时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆