Cassandra桶拆分,用于分区大小 [英] Cassandra bucket splitting for partition sizing

查看:130
本文介绍了Cassandra桶拆分,用于分区大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Cassandra还是很陌生,我只是通过Datastax课程学习它的,但是我在这里或Internet上没有关于存储桶的足够信息,而在我的应用程序中,我需要使用存储桶来拆分数据。

I am quite new to Cassandra, I just learned it with Datastax courses, but I don't find enough information on bucket here or on the Internet and in my application I need to use buckets to split my data.

我有一些仪器可以进行很多测量,并且每天拆分测量(以时间戳作为分区键)可能有点冒险,因为我们很容易达到100MB的上限用于分区。每个度量值都涉及一个用ID标识的特定对象。所以我想用一个水桶,但我不知道该怎么办。

I have some instruments that will make measures, quite a lot, and splitting the measures daily (timestamp as partition key) might be a bit risky as we can easily reach the limit of 100MB for a partition. Each measure concerns a specific object identified with an ID. So I would like to use a bucket, but I don't know how to do.

我正在使用Cassandra 3.7

I'm using Cassandra 3.7

这是我的表的大致样子:

Here is how my table will look like, roughly:

CREATE TABLE measures (
  instrument_id bigint,
  day timestamp,
  bucket int,
  measure_timestamp timestamp,
  measure_id uuid,
  measure_info float,
  object_id bigint,
  PRIMARY KEY ((instrument_id, day, bucket), measure_timestamp, measure_id)
);

我考虑将object_id添加为分区键,但是随后我松开了措施流程工具引起的,我感兴趣的是看到工具在特定的日期或时间段内采取的所有措施。

I thought of adding the object_id as a partition key, but then I loose the "flow of measures" made by an instrument, as what interests me is seeing all the measures made by an instrument in a specific day or period of time.


  • 问题是,当我要请求特定仪器一天的所有记录时,如果有很多存储桶,该怎么办?

  • 如果我希望分区限制为40万行,在将数据插入哪个存储桶中时如何得知?

  • 是否有办法知道其中的存储桶数量?

非常感谢您的帮助!

推荐答案

您应该专注于您的需求,然后再回到架构模型。您的情况下,每种仪器每天可以执行多少措施?如果每个人所做的不到40万个指标,则说明您已经完成工作,而无需进行任何操作。如果您的仪器最多可以执行10M次测量,则 N = 10M / 400k 存储桶应该足以满足您的要求。假设 N 个存储桶,当您需要查询来自特定工具的所有度量时,您必须执行 N 个查询,每个存储桶一次,除非您可以在计算期间对这些度量进行计数您的写操作,以便在存储桶已满时可以更改存储桶。我的意思是,您将第一个400k度量写入存储桶0中,然后将第二个400k度量写入存储桶1中,依此类推。然后,您需要跟踪插入了多少 K 个存储桶,而只对 N 执行 K 个查询。这样,您的存储桶(和分区)就变得不平衡了,但是您得到的查询数量最少。如果您更喜欢平衡存储桶方法,则可以在均匀分布的随机存储桶编号中执行每次写操作,但随后必须执行所有 N 查询才能获取特定工具的所有数据

You should focus on your requirements, and then go back to your schema model. In your case, how many measures per day each instruments can do? If each one can do less than your 400k measures then you're already done without bucketing. If your instruments can perform up to 10M measures each, then N=10M/400k buckets should be enough to satisfy your requirements. Assuming N buckets, when you need to query all the measures coming from a particular instrument you have to perform N queries, one for each bucket, unless you can count the measures during your writes, so that you can change bucket when a bucket is full. I mean, you write the first 400k measures in the bucket 0, then you write the second 400k measures to the bucket 1, and so on. Then you need to keep track of on how many K buckets you inserted data and perform only K queries instead on N. That way you have unbalanced buckets (and partitions), but you get your results in the smallest number of queries. If you prefer a balanced-bucket approach instead, you can perform each write in a uniformly distributed random bucket number, but then you have to perform all of your N queries to get all the data of a specific instrument.

这篇关于Cassandra桶拆分,用于分区大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆