用于分区大小的 Cassandra 桶拆分 [英] Cassandra bucket splitting for partition sizing

查看:30
本文介绍了用于分区大小的 Cassandra 桶拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Cassandra 很陌生,我刚刚通过 Datastax 课程学习了它,但是我在这里或互联网上找不到足够的关于存储桶的信息,在我的应用程序中,我需要使用存储桶来拆分我的数据.

I am quite new to Cassandra, I just learned it with Datastax courses, but I don't find enough information on bucket here or on the Internet and in my application I need to use buckets to split my data.

我有一些工具可以进行很多度量,每天拆分度量(时间戳作为分区键)可能有点冒险,因为我们很容易达到分区 100MB 的限制.每个度量都涉及一个用 ID 标识的特定对象.所以我想用一个bucket,但是不知道怎么做.

I have some instruments that will make measures, quite a lot, and splitting the measures daily (timestamp as partition key) might be a bit risky as we can easily reach the limit of 100MB for a partition. Each measure concerns a specific object identified with an ID. So I would like to use a bucket, but I don't know how to do.

我使用的是 Cassandra 3.7

I'm using Cassandra 3.7

这是我的桌子的大致样子:

Here is how my table will look like, roughly:

CREATE TABLE measures (
  instrument_id bigint,
  day timestamp,
  bucket int,
  measure_timestamp timestamp,
  measure_id uuid,
  measure_info float,
  object_id bigint,
  PRIMARY KEY ((instrument_id, day, bucket), measure_timestamp, measure_id)
);

我想将 object_id 添加为分区键,但后来我失去了仪器所做的度量流",因为我感兴趣的是查看特定日期或时间段内仪器所做的所有度量.

I thought of adding the object_id as a partition key, but then I loose the "flow of measures" made by an instrument, as what interests me is seeing all the measures made by an instrument in a specific day or period of time.

  • 那么问题是,当我想请求某个特定工具的一天的所有记录时,如果有很多桶,我该怎么办?
  • 如果我希望分区限制为 400 000 行,我怎么知道在插入数据时我必须将数据插入哪个桶?
  • 有没有办法知道桶的数量?

非常感谢您的帮助!

推荐答案

你应该专注于你的需求,然后回到你的模式模型.在您的情况下,每种仪器每天可以进行多少次测量?如果每个人都可以做少于 400k 的措施,那么你已经完成了而没有分桶.如果您的仪器每个可以执行多达 10M 的测量,那么 N=10M/400k 存储桶应该足以满足您的要求.假设有 N 个桶,当您需要查询来自特定工具的所有度量时,您必须执行 N 次查询,每个桶一个,除非您可以在期间计算度量您的写入,以便您可以在存储桶已满时更改存储桶.我的意思是,您将前 400k 度量写入存储桶 0,然后将第二个 400k 度量写入存储桶 1,依此类推.然后,您需要跟踪插入数据的 K 个存储桶的数量,并仅在 N 上执行 K 次查询.这样,您就有了不平衡的存储桶(和分区),但您可以在最少数量的查询中获得结果.如果您更喜欢平衡桶方法,则可以在均匀分布的随机桶数中执行每次写入,但是您必须执行所有 N 次查询以获取特定工具的所有数据.

You should focus on your requirements, and then go back to your schema model. In your case, how many measures per day each instruments can do? If each one can do less than your 400k measures then you're already done without bucketing. If your instruments can perform up to 10M measures each, then N=10M/400k buckets should be enough to satisfy your requirements. Assuming N buckets, when you need to query all the measures coming from a particular instrument you have to perform N queries, one for each bucket, unless you can count the measures during your writes, so that you can change bucket when a bucket is full. I mean, you write the first 400k measures in the bucket 0, then you write the second 400k measures to the bucket 1, and so on. Then you need to keep track of on how many K buckets you inserted data and perform only K queries instead on N. That way you have unbalanced buckets (and partitions), but you get your results in the smallest number of queries. If you prefer a balanced-bucket approach instead, you can perform each write in a uniformly distributed random bucket number, but then you have to perform all of your N queries to get all the data of a specific instrument.

这篇关于用于分区大小的 Cassandra 桶拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆