用于分区大小的 Cassandra 存储桶拆分 [英] Cassandra bucket splitting for partition sizing

查看:17
本文介绍了用于分区大小的 Cassandra 存储桶拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Cassandra 还很陌生,我只是通过 Datastax 课程学习的,但是我在此处或 Internet 上找不到有关存储桶的足够信息,并且在我的应用程序中,我需要使用存储桶来拆分我的数据.

I am quite new to Cassandra, I just learned it with Datastax courses, but I don't find enough information on bucket here or on the Internet and in my application I need to use buckets to split my data.

我有一些可以测量的工具,很多,并且每天分割测量(时间戳作为分区键)可能有点冒险,因为我们很容易达到一个分区的 100MB 的限制.每个度量都涉及一个用 ID 标识的特定对象.所以想用桶,但是不知道怎么做.

I have some instruments that will make measures, quite a lot, and splitting the measures daily (timestamp as partition key) might be a bit risky as we can easily reach the limit of 100MB for a partition. Each measure concerns a specific object identified with an ID. So I would like to use a bucket, but I don't know how to do.

我正在使用 Cassandra 3.7

I'm using Cassandra 3.7

这是我的桌子大致的样子:

Here is how my table will look like, roughly:

CREATE TABLE measures (
  instrument_id bigint,
  day timestamp,
  bucket int,
  measure_timestamp timestamp,
  measure_id uuid,
  measure_info float,
  object_id bigint,
  PRIMARY KEY ((instrument_id, day, bucket), measure_timestamp, measure_id)
);

我想将 object_id 添加为分区键,但随后我松开了仪器所做的测量流程",因为我感兴趣的是查看仪器在特定日期或时间段内所做的所有测量.

I thought of adding the object_id as a partition key, but then I loose the "flow of measures" made by an instrument, as what interests me is seeing all the measures made by an instrument in a specific day or period of time.

  • 那么问题来了,当我想查询某一特定工具一天的所有记录时,如果桶数很多怎么办?
  • 如果我希望分区限制为 400 000 行,如何知道插入数据时必须在哪个存储桶中插入数据?
  • 有没有办法知道桶的数量?

非常感谢您的帮助!

推荐答案

你应该专注于你的需求,然后回到你的模式模型.在您的情况下,每种仪器每天可以执行多少次测量?如果每个人都可以做少于 400k 的措施,那么你已经完成了没有分桶的工作.如果您的仪器每个可以执行多达 10M 次测量,那么 N=10M/400k 个存储桶应该足以满足您的要求.假设 N 个桶,当您需要查询来自特定仪器的所有度量时,您必须执行 N 个查询,每个桶一个,除非您可以在期间计算度量您的写入,以便您可以在存储桶已满时更改存储桶.我的意思是,您将前 400k 度量写入存储桶 0,然后将第二个 400k 度量写入存储桶 1,依此类推.然后,您需要跟踪您插入数据的 K 个桶的数量,并仅在 N 上执行 K 个查询.这样你就有了不平衡的桶(和分区),但是你在最少的查询中得到了结果.如果您更喜欢平衡桶方法,则可以在均匀分布的随机桶数中执行每次写入,但是您必须执行所有 N 查询才能获取特定工具的所有数据.

You should focus on your requirements, and then go back to your schema model. In your case, how many measures per day each instruments can do? If each one can do less than your 400k measures then you're already done without bucketing. If your instruments can perform up to 10M measures each, then N=10M/400k buckets should be enough to satisfy your requirements. Assuming N buckets, when you need to query all the measures coming from a particular instrument you have to perform N queries, one for each bucket, unless you can count the measures during your writes, so that you can change bucket when a bucket is full. I mean, you write the first 400k measures in the bucket 0, then you write the second 400k measures to the bucket 1, and so on. Then you need to keep track of on how many K buckets you inserted data and perform only K queries instead on N. That way you have unbalanced buckets (and partitions), but you get your results in the smallest number of queries. If you prefer a balanced-bucket approach instead, you can perform each write in a uniformly distributed random bucket number, but then you have to perform all of your N queries to get all the data of a specific instrument.

这篇关于用于分区大小的 Cassandra 存储桶拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆