拥有一个具有单行分区的Cassandra表是一种不好的做法吗? [英] Is it a bad practice to have a Cassandra table with partitions of a single row?
问题描述
假设我有一个这样的表
CREATE TABLE request(
transaction_id text,
request_date timestamp,
data text,
PRIMARY KEY (transaction_id)
);
transaction_id是唯一的,据我了解,该表中的每个分区只能有一行,我不确定这种情况是否会导致OS性能问题,可能是因为Cassandra为每个分区创建了一个文件,从而导致大量文件要为其托管OS进行管理,请注意,我不确定Cassandra如何为表创建文件。
The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.
在这种情况下,我可以通过它的transaction_id查找请求,例如
In this scenario I can find a request by its transaction_id like
从中选择数据请求where transaction_id ='abc';
如果先前的假设是正确的,那么下一个可能是另一种方法?
If the previous assumption is correct, a different approach could be the next one?
CREATE TABLE request(
the_date date,
transaction_id text,
request_date timestamp,
data text,
PRIMARY KEY ((the_date), transaction_id)
);
字段 the_date 每隔一天就会更改,因此将创建表中的分区
The field the_date would change every next day, so the partitions in the table would be created for each day.
在这种情况下,我必须始终对客户端保持 the_date 数据,这样我才能找到请求使用下一个查询
In this scenario I would have to have the_date data always available to the client so I can find a request using the next query
从请求中选择数据,其中the_date ='2020-09-23'和transaction_id ='abc';
在此先感谢您的帮助!
推荐答案
Cassandra不会创建每个分区的单独文件。一个SSTable文件可能包含多个分区。仅由一行组成的分区通常被称为瘦行。 -它们不是很坏,但是可能会导致一些性能问题:
Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:
- 要访问此类分区,您仍然需要读取带有压缩数据的块(默认情况下)需要将其解压缩为64Kb)以读取该数据。如果您要进行真正的随机访问,则此类块将从文件缓存中丢弃,并需要从磁盘中重新读取。在这种情况下,减小块大小可能是有用的
- 如果每个节点的每个表都有很多这样的分区,这可能会大大增加Bloom过滤器的大小,因为每个分区有一个单独的条目。我看到有些客户仅由于分区狭窄而为bloom过滤器分配了数十GB的内存。
所以,这实际上取决于数据量,访问模式等。它的好坏取决于这两个因素。
so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.
如果您有可用的日期,并希望将其用作部分分区键,那么这也不可取,因为如果您当天要写入和读取大量数据,则只有一些节点可以处理该负载-这就是所谓的热分区。
If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".
您可以实现为当您从数据中推断分区键时,称为存储桶。但这将取决于可用的数据。例如,如果您将日期+交易ID作为字符串,则可以将分区键创建为日期+该字符串的第一个字符-在这种情况下,您每天将拥有N个分区键,这些分区键分布在节点之间,从而消除了热点分区问题。
You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.
这篇关于拥有一个具有单行分区的Cassandra表是一种不好的做法吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!