为什么在相对较小的数据集上,特定分区上的Cassandra COUNT(*)会花费很长时间 [英] Why Cassandra COUNT(*) on a specific partition takes really long on relatively small datasets

查看:98
本文介绍了为什么在相对较小的数据集上,特定分区上的Cassandra COUNT(*)会花费很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个定义如下的表:

I have a table defined like:

键空间:

CREATE KEYSPACE messages WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;

表格:

CREATE TABLE messages.textmessages (
    categoryid int,
    date timestamp,
    messageid timeuuid,
    message text,
    userid int,
    PRIMARY KEY ((categoryid, date), messageid)
) WITH CLUSTERING ORDER BY (messageid ASC);

目标是拥有宽的行时间序列存储,以使categoryiddate(一天的开始)构成我的分区键,而messageid提供聚类.这使我可以执行以下查询:

The goal is to have a wide row time-series storage such that categoryid and date(beginning of day) constitutes my partition key and the messageid provides the clustering. This enables me to do queries like:

SELECT * FROM messages.textmessages WHERE categoryid=2 AND date='2019-05-14 00:00:00.000+0300' AND messageId > maxTimeuuid('2019-05-14 00:00:00.000+0300') AND messageId < minTimeuuid('2019-05-15 00:00:00.000+0300')

在给定的一天中获取消息;它是如此之好,如此之快!

to get messages in a given day; it works so well so fast!

问题

我需要能够通过用SELECT COUNT(*)替换上面的SELECT *来计算给定日期的邮件.即使列族中的条目少于100K,这也将花费很长时间.实际上在cqlsh上超时.

I need to be able to count the messages in a given day by substituting SELECT * above with SELECT COUNT(*). This takes very long even with a little less than 100K entries in the column family; it actually times out on cqlsh.

对于 问题

为什么即使在以下情况下,该查询也会花费这么长时间:

Why would this query take so long even when:

SELECT COUNT(*) FROM messages.textmessages WHERE categoryid=2 AND date='2019-05-14 00:00:00.000+0300' AND messageId > maxTimeuuid('2019-05-14 00:00:00.000+0300') AND messageId < minTimeuuid('2019-05-15 00:00:00.000+0300')

  1. 该计数位于少于10万条记录的特定分区上
  2. 我在高性能Macbook Pro上只有一个Cassandra节点
  3. 该实例中没有活动的写入/读取;开发笔记本电脑上的分区少于20个

推荐答案

这是由

This is understandably caused by a common pitfall when the concept of 'everything-is-a-write' in Cassandra is overlooked and thence why tombstones happen.

在分区内或分区内执行扫描时,我们需要将逻辑删除保留在内存中,以便将它们返回给协调器,协调器将使用它们来确保其他副本也知道已删除的行.在产生大量逻辑删除的工作负载下,这可能会导致性能问题,甚至耗尽服务器堆.

When executing a scan, within or across a partition, we need to keep the tombstones seen in memory so we can return them to the coordinator, which will use them to make sure other replicas also know about the deleted rows. With workloads that generate a lot of tombstones, this can cause performance problems and even exhaust the server heap.

感谢@JimWartnick关于墓碑相关延迟的建议;这是由于我的插入物具有NULL字段而产生的大量墓碑引起的.我没想到这会导致逻辑删除,我也没想到逻辑删除会对查询性能产生重大影响.尤其是COUNT.

Thanks to @JimWartnick's suggestion on possible tombstone related latency; this was casued by overwhelming amount of tombstones generated by my inserts that had NULL fields. I did not expect this to cause tombstones, neither did I expect tombstones to be a big deal in query performance; especially the COUNT.

解决方案

  1. 在字段中使用默认的未设置值(如果不存在)或在插入/更新中将其完全忽略
  2. 认识到以下事实,如 Cassandra墓碑的常见问题-Alla巴布基纳(Babkina)

一个常见的误解是,逻辑删除仅在客户端向Cassandra发出DELETE语句时出现.一些开发人员认为选择Cassandra完全没有墓碑的操作方式是安全的.实际上,除了发出DELETE语句外,还有许多其他原因会导致逻辑删除.使用TTL插入空值,插入集合和使数据过期是墓碑的常见来源.

One common misconception is that tombstones only appear when the client issues DELETE statements to Cassandra. Some developers assume that it is safe to choose a way of operations which relies on Cassandra being completely tombstone free. In reality there are other many other things causing tombstones apart from issuing DELETE statements. Inserting null values, inserting collections and expiring data using TTL are common sources of tombstones.

这篇关于为什么在相对较小的数据集上,特定分区上的Cassandra COUNT(*)会花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆