为日期查询构造Cassandra表 [英] structuring Cassandra table for date queries

查看:128
本文介绍了为日期查询构造Cassandra表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习Cassandra,作为练习数据集,我正在从Yahoo获取历史股票数据。每个交易日都会有一个记录。


显然,我需要将股票代码作为分区键的一部分。我看到关于是否应该将日期作为分区键的一部分,还是将其设置为群集列的信息存在冲突?


实际上,股市每年开放〜253天。因此,一只股票每年将有253条记录。我不是在建立完整的数据库,而是想对其进行设计以使其能够正确地容纳/正确使用数据库。跨节点传播?

解决方案


如果将日期作为分区键的一部分,那不会分散在节点上吗?使日期范围查询变慢吗?


是的,两个帐户都正确。该建模方法称为时间段。其主要用例是随时间增长的时间/事件数据。好消息是,除非您的分区预计会变大,否则您无需这样做。根据您目前的预测,每个分区每年写入253行,那只会是<每年40kb(请参阅下面的 nodetool表直方图进行计算)。


出于您的目的,我认为按符号进行分区并在聚类就足够了。

 创建表股票报价(
符号文字,
日日期,
价格小数点,
主键(符号,日期))
带有排序顺序的(日期DESC);

在大多数基于时间的用例中,我们倾向于更多地关注最近的数据(对于或不正确的情况,您的情况)。如果是这样,那么在之前按降序写入数据将提高那些查询的性能。


然后(在写入一些数据之后),这样的日期范围查询将起作用:

  SELECT * FROM stockquotes 
WHERE symbol ='AAPL'
AND day > ='2020-08-01'AND天< ‘2020-08-08’;

符号|天价格
-------- + ------------ + --------
AAPL | 2020-08-07 | 444.45
AAPL | 2020-08-06 | 455.61
AAPL | 2020-08-05 | 440.25
AAPL | 2020-08-04 | 438.66
AAPL | 2020-08-03 | 435.75

(5行)

要验证分区大小是否可以使用 nodetool tablehistograms (一旦将数据刷新到磁盘)。

  bin / nodetool tablehistograms stackoverflow.stockquotes 
stackoverflow / stockquotes直方图
读延迟写入百分率SSTables分区大小单元计数
(微)(微)(字节)
50%0.00 0.00 0.00 124 5
75%0.00 0.00 0.00 124 5
95%0.00 0.00 0.00 124 5
98%0.00 0.00 0.00124 5
99%0.00 0.00 0.00 124 5
最低0.00 0.00 0.00 104 5
最高0.00 0.00 0.00 124 5

每年分区大小= 124字节x 253 = 31kb


鉴于分区的大小,此模型可能至少对 30年数据有效(我建议保持分区< = 1mb)。也许像 quartercentiry 这样的东西就够了吗?无论如何,短期内都可以。


编辑:


好像PK中使用的任何日期部分都会在节点之间散布数据,不是吗?


是的,分区键中使用的日期部分将会散布跨节点的数据。实际上就是这样做的重点。您不想以无限行增长的反模式而告终,因为分区最终会变得太大而无法使用。这个想法全是关于确保适当的数据分配。


让我说1 /秒,我需要跨年查询等等。 / p>

因此,进行时间分段的窍门是找到一种快乐的媒介,数据分配和查询灵活性之间的关系。不幸的是,在某些情况下查询可能会触及多个分区(节点)。


这里的示例问题是一年1秒,这有点极端。但是解决这个问题的想法是相同的。一天有86400秒。根据行的大小,每天甚至可能太多了。但是为了争辩,可以说。如果我们每天都忙,则PK看起来像这样:

  PRIMARY KEY((符号,天),时间戳)

然后 WHERE 子句开始看起来像这样:

  WHERE symbol ='AAPL'AND day IN('2020-08-06','2020-08-07'); 

从另一方面来说,几天没问题,但查询整年会很麻烦。此外,我们不希望构建253天的 IN 子句。实际上,我不建议人们在 IN 上使用个位数以上的数字。


这里可能的方法是fire 253异步从应用程序中查询(每天一次),然后在该处组装并排序结果集。在这里,使用Spark(在RDD中完成所有操作)也是一个不错的选择。实际上,Cassandra并不是报表API的理想数据库,因此探索一些其他工具很有用。


I'm learning Cassandra, and as a practice data set, I'm grabbing historical stock data from Yahoo. There is going to be one record for each trading day.

Obviously, I need to make the stock symbol as a part of the partitioning key. I'm seeing conflicting information on whether I should make the date as part of the partitioning key, or make it a clustering column?

Realistically, the stock market is open ~253 days per year. So a single stock will have ~253 records per year. I'm not building a full scale database, but would like to design it to accommodate / correctly.

If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?

解决方案

If I make the date part of the partition key, won't that be possibly be spread across nodes? Make a date range query slow?

Yes, correct on both accounts. That modeling approach is called "time bucketing," and its primary use case is for time/event data that grows over time. The good news is, that you wouldn't need to do that, unless your partitions were projected to get big. With your current projection of 253 rows written per partition per year, that's only going to be < 40kb each year (see calculation with nodetool tablehistograms below).

For your purposes I think partitioning by symbol and clustering by day should suffice.

CREATE TABLE stockquotes (
 symbol text,
 day date,
 price decimal,
 PRIMARY KEY(symbol, day))
 WITH CLUSTERING ORDER BY (day DESC);

With most time-based use cases, we tend to care about recent data more (which may or may not be true with your case). If so, then writing the data in descending order by day will improve the performance of those queries.

Then (after writing some data), date range queries like this will work:

SELECT * FROM stockquotes 
WHERE symbol='AAPL'
  AND day >= '2020-08-01' AND day < '2020-08-08';

 symbol | day        | price
--------+------------+--------
   AAPL | 2020-08-07 | 444.45
   AAPL | 2020-08-06 | 455.61
   AAPL | 2020-08-05 | 440.25
   AAPL | 2020-08-04 | 438.66
   AAPL | 2020-08-03 | 435.75

(5 rows)

To verify the partition sizes can use nodetool tablehistograms (once the data is flushed to disk).

bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile      Read Latency     Write Latency          SSTables    Partition Size        Cell Count
                    (micros)          (micros)                             (bytes)
50%                     0.00              0.00              0.00               124                 5
75%                     0.00              0.00              0.00               124                 5
95%                     0.00              0.00              0.00               124                 5
98%                     0.00              0.00              0.00               124                 5
99%                     0.00              0.00              0.00               124                 5
Min                     0.00              0.00              0.00               104                 5
Max                     0.00              0.00              0.00               124                 5

Partition size each year = 124 bytes x 253 = 31kb

Given the tiny partition size, this model would probably be good for at least 30 years of data before any slow-down (I recommend keeping partitions <= 1mb). Perhaps bucketing on something like quartercentiry might suffice? Regardless, in the short term, it'll be fine.

Edit:

Seems like any date portion used in the PK would spread the data across nodes, no?

Yes, a date portion used in the partition key would spread the data across nodes. That's actually the point of doing it. You don't want to end up with the anti-pattern of unbound row growth, because the partitions will eventually get so large that they'll be unusable. This idea is all about ensuring adequate data distribution.

lets say 1/sec and I need to query across years, etc. How would that bucketing work?

So the trick with time bucketing, is to find a "happy medium" between data distribution and query flexibility. Unfortunately, there will likely be edge cases where queries will hit more than one partition (node). But the idea is to build a model to handle most of them well.

The example question here of 1/sec for a year, is a bit extreme. But the idea to solve it is the same. There are 86400 seconds in a day. Depending on row size, that may even be too much to bucket by day. But for sake of argument, say we can. If we bucket on day, the PK looks like this:

PRIMARY KEY ((symbol,day),timestamp)

And the WHERE clause starts to look like this:

WHERE symbol='AAPL' AND day IN ('2020-08-06','2020-08-07');

On the flip side of that, a few days is fine but querying for an entire year would be cumbersome. Additionally, we wouldn't want to build an IN clause of 253 days. In fact, I don't recommend folks exceed single digits on an IN.

A possible approach here, would be fire 253 asynchronous queries (one for each day) from the application, and then assemble and sort the result set there. Using Spark (to do everything in a RDD) is a good option here, too. In reality, Cassandra isn't a great DB for a reporting API, so there is value in exploring some additional tools.

这篇关于为日期查询构造Cassandra表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆