拥有一个具有单行分区的Cassandra表是一种不好的做法吗? [英] Is it a bad practice to have a Cassandra table with partitions of a single row?

查看:64
本文介绍了拥有一个具有单行分区的Cassandra表是一种不好的做法吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个这样的表

CREATE TABLE request(
  transaction_id text,
  request_date timestamp,
  data text, 
  PRIMARY KEY (transaction_id)
);

transaction_id是唯一的,据我了解,该表中的每个分区只能有一行,我不确定这种情况是否会导致OS性能问题,可能是因为Cassandra为每个分区创建了一个文件,从而导致大量文件要为其托管OS进行管理,请注意,我不确定Cassandra如何为表创建文件。

The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.

在这种情况下,我可以通过它的transaction_id查找请求,例如

In this scenario I can find a request by its transaction_id like

从中选择数据请求where transaction_id ='abc';

如果先前的假设是正确的,那么下一个可能是另一种方法?

If the previous assumption is correct, a different approach could be the next one?

CREATE TABLE request( 
  the_date date, 
  transaction_id text, 
  request_date timestamp, 
  data text, 
  PRIMARY KEY ((the_date), transaction_id)
);

字段 the_date 每隔一天就会更改,因此将创建表中的分区

The field the_date would change every next day, so the partitions in the table would be created for each day.

在这种情况下,我必须始终对客户端保持 the_date 数据,这样我才能找到请求使用下一个查询

In this scenario I would have to have the_date data always available to the client so I can find a request using the next query

从请求中选择数据,其中the_date ='2020-09-23'和transaction_id ='abc';

在此先感谢您的帮助!

推荐答案

Cassandra不会创建每个分区的单独文件。一个SSTable文件可能包含多个分区。仅由一行组成的分区通常被称为瘦行。 -它们不是很坏,但是可能会导致一些性能问题:

Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:


  • 要访问此类分区,您仍然需要读取带有压缩数据的块(默认情况下)需要将其解压缩为64Kb)以读取该数据。如果您要进行真正的随机访问,则此类块将从文件缓存中丢弃,并需要从磁盘中重新读取。在这种情况下,减小块大小可能是有用的

  • 如果每个节点的每个表都有很多这样的分区,这可能会大大增加Bloom过滤器的大小,因为每个分区有一个单独的条目。我看到有些客户仅由于分区狭窄而为bloom过滤器分配了数十GB的内存。

所以,这实际上取决于数据量,访问模式等。它的好坏取决于这两个因素。

so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.

如果您有可用的日期,并希望将其用作部分分区键,那么这也不可取,因为如果您当天要写入和读取大量数据,则只有一些节点可以处理该负载-这就是所谓的热分区。

If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".

您可以实现为当您从数据中推断分区键时,称为存储桶。但这将取决于可用的数据。例如,如果您将日期+交易ID作为字符串,则可以将分区键创建为日期+该字符串的第一个字符-在这种情况下,您每天将拥有N个分区键,这些分区键分布在节点之间,从而消除了热点分区问题。

You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.

请参见有关该主题的DataStax最佳做法文档

这篇关于拥有一个具有单行分区的Cassandra表是一种不好的做法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆