Cassandra-处理分区和存储桶以处理大数据量 [英] Cassandra - Handling partition and bucket for large data size

查看:67
本文介绍了Cassandra-处理分区和存储桶以处理大数据量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个要求,应用程序需要在Cassandra数据库中读取文件并将数据插入,但是该表在一天内最多可以增长到300+ MB.该表将具有以下结构

We have a requirement where application reads file and inserts data in Cassandra database, however the table can grow up to 300+ MB in one shot during the day. The table will have below structure

create table if not exists orders (
id uuid,
record text,
status varchar,
create_date timestamp,
modified_date timestamp,
primary key (status, create_date));

状态"列的值可以为[开始,完成,完成]根据Internet上的几个文档,如果<<在修改最少的列上应使用100 MB和索引(因此我不能将状态"列用作索引).另外,如果我使用TWCS作为分钟"的存储桶,那么会有很多存储桶,并且可能会造成影响.

'Status' column can have value [Started, Completed, Done] As per couple of documents on internet, READ performance is best if it's < 100 MB and index should be used on a column that's least modified (so I cannot use 'status' column as index). Also if I use buckets with TWCS as Minutes then there will be lots of buckets and may impact.

因此,我如何更好地利用分区和/或存储桶在分区之间均匀地插入并读取具有适当状态的记录.

So, how can I better make use of partitions and/or buckets for inserting evenly across partitions and reading records with appropriate status.

谢谢.

推荐答案

从评论的讨论中可以看出,您似乎正在尝试将Cassandra用作队列,这是一个很大的反模式.
虽然可以存储有关在Cassandra中完成的操作的数据,但应该在队列中查找类似Kafka或RabbitMQ的内容.

From the discussion in the comments it looks like you are trying to use Cassandra as a queue and that is a big anti-pattern.
While you could store data about the operations you've done in Cassandra, you should look for something like Kafka or RabbitMQ for the queuing.

它可能看起来像这样:

  1. 应用程序1复制/生成记录A;
  2. 应用程序1将A的路径添加到队列中;
  3. 应用程序1根据文件ID/路径在分区中向cassandra追加证书(其他列可以是诸如日期,复制时间,文件哈希等之类的信息);
  4. 应用程序2读取队列,找到A,对其进行处理,并确定它是失败还是完成;
  5. 应用程序2向卡桑德拉(cassandra)更新有关处理的信息,包括状态.您还可以拥有诸如失败原因之类的东西;
  6. 如果失败,则可以将路径/id写入另一个主题.

因此,总而言之,不要尝试将Cassandra用作队列,这是全球公认的反模式.您可以并且应该使用Cassandra来保存已完成操作的日志,包括处理结果(如果适用),文件的处理方式,结果等.
根据您进一步需要在Cassandra中读取和使用数据的方式,您可以考虑根据诸如文件源,文件类型等之类的内容使用分区和存储桶.如果没有,则可以按唯一值对它们进行分区就像我在表中看到的UUID.然后您可能会基于此获取有关它的信息.

So to sum it up, don't try to use Cassandra as a queue, that is a globally accepted anti-pattern. You can and should use Cassandra to persist a log of what you have done, including maybe the results of the processing (if applicable), how files were processed, their result and so on.
Depending on how you would further need to read and use the data in Cassandra you could think about using partitions and buckets based on stuff like, source of the file, type of file etc. If not, you could keep it partitioned by a unique value like the UUID I've seen in your table. Then you could maybe come to get info about it based on that.

希望这个被治愈的人,
干杯!

Hope this heleped,
Cheers!

这篇关于Cassandra-处理分区和存储桶以处理大数据量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆