拥有一个具有单行分区的Cassandra表是一种不好的做法吗？ [英] Is it a bad practice to have a Cassandra table with partitions of a single row?

查看：64 发布时间：2020/9/29 20:50:44 cassandra primary-key partition

本文介绍了拥有一个具有单行分区的Cassandra表是一种不好的做法吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个这样的表

CREATE TABLE request(
  transaction_id text,
  request_date timestamp,
  data text, 
  PRIMARY KEY (transaction_id)
);

transaction_id是唯一的，据我了解，该表中的每个分区只能有一行，我不确定这种情况是否会导致OS性能问题，可能是因为Cassandra为每个分区创建了一个文件，从而导致大量文件要为其托管OS进行管理，请注意，我不确定Cassandra如何为表创建文件。

The transaction_id is unique, so as far as I understand each partition in this table would have one row only and I'm not sure if this situation causes a performance issue in the OS, maybe because Cassandra creates a file for each partition causing lots of files to manage for its hosting OS, as a note I'm not sure how Cassandra creates its files for its tables.

在这种情况下，我可以通过它的transaction_id查找请求，例如

In this scenario I can find a request by its transaction_id like

从中选择数据请求where transaction_id ='abc';

如果先前的假设是正确的，那么下一个可能是另一种方法？

If the previous assumption is correct, a different approach could be the next one?

CREATE TABLE request( 
  the_date date, 
  transaction_id text, 
  request_date timestamp, 
  data text, 
  PRIMARY KEY ((the_date), transaction_id)
);

字段 the_date 每隔一天就会更改，因此将创建表中的分区

The field the_date would change every next day, so the partitions in the table would be created for each day.

在这种情况下，我必须始终对客户端保持 the_date 数据，这样我才能找到请求使用下一个查询

In this scenario I would have to have the_date data always available to the client so I can find a request using the next query

从请求中选择数据，其中the_date ='2020-09-23'和transaction_id ='abc';

在此先感谢您的帮助！

推荐答案

Cassandra不会创建每个分区的单独文件。一个SSTable文件可能包含多个分区。仅由一行组成的分区通常被称为瘦行。 -它们不是很坏，但是可能会导致一些性能问题：

Cassandra doesn't create a separate file for each partition. One SSTable file may contain multiple partitions. Partitions that consist only of one row are often called "skinny rows" - they aren't very bad, but may cause some performance issues:

要访问此类分区，您仍然需要读取带有压缩数据的块（默认情况下）需要将其解压缩为64Kb）以读取该数据。如果您要进行真正的随机访问，则此类块将从文件缓存中丢弃，并需要从磁盘中重新读取。在这种情况下，减小块大小可能是有用的

如果每个节点的每个表都有很多这样的分区，这可能会大大增加Bloom过滤器的大小，因为每个分区有一个单独的条目。我看到有些客户仅由于分区狭窄而为bloom过滤器分配了数十GB的内存。

所以，这实际上取决于数据量，访问模式等。它的好坏取决于这两个因素。

so it's really depends on the amount of data, access patterns, etc. It could be good or bad, depends on that factors.

如果您有可用的日期，并希望将其用作部分分区键，那么这也不可取，因为如果您当天要写入和读取大量数据，则只有一些节点可以处理该负载-这就是所谓的热分区。

If you have date available, and want to use it as part partition key - that may also not advisable because if you're writing and reading a lot of data on that day, then only some nodes will handle that load - this is so-called "hot partitions".

您可以实现为当您从数据中推断分区键时，称为存储桶。但这将取决于可用的数据。例如，如果您将日期+交易ID作为字符串，则可以将分区键创建为日期+该字符串的第一个字符-在这种情况下，您每天将拥有N个分区键，这些分区键分布在节点之间，从而消除了热点分区问题。

You may implement so-called bucketing, when you infer partition key from the data. But this will depend on the data available. For example, if you have date + transaction ID as a string, you may create partition key as date + 1st character of that string - in this case you'll have N partition keys per day, that are distributed between nodes, eliminating the hot partition problem.

请参见有关该主题的DataStax最佳做法文档。

这篇关于拥有一个具有单行分区的Cassandra表是一种不好的做法吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

拥有一个具有单行分区的Cassandra表是一种不好的做法吗？ [英] Is it a bad practice to have a Cassandra table with partitions of a single row?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

拥有一个具有单行分区的Cassandra表是一种不好的做法吗？ [英] Is it a bad practice to have a Cassandra table with partitions of a single row?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭