在 Cassandra 中跨分区聚合 [英] Aggregation in Cassandra across partitions

查看:21
本文介绍了在 Cassandra 中跨分区聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据模型,

I have a Data model like below,

CREATE TABLE appstat.nodedata (
    nodeip text,
    timestamp timestamp,
    flashmode text,
    physicalusage int,
    readbw int,
    readiops int,
    totalcapacity int,
    writebw int,
    writeiops int,
    writelatency int,
    PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)

其中,nodeip - 主键和时间戳 - 聚类键(按降序排序以获取最新的),

where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),

此表中的示例数据,

SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;

 nodeip       | timestamp                       | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
 172.30.56.60 | 2017-12-08 06:13:07.161000+0000 |       yes |            34 |     57 |       19 |            27 |       8 |        89 |           57
 172.30.56.60 | 2017-12-08 06:12:07.161000+0000 |       yes |            70 |      6 |       43 |            88 |      79 |        83 |           89

这是正确可用的,每当我需要获取统计信息时,我都可以使用如下分区键获取数据,

This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

还成功聚合了如下数据,

Also successfully aggregate the data like below,

SELECT sum(readbw) FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

现在是下一个用例,我需要获取集群数据(四个节点的所有数据),

如下图,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.62','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;

但是它在许多站点中明确提到,IN 查询"有很多性能问题,那么您对上面提到的nodedata"表的数据模型有何建议?(注意:在不同的分区中进行多个查询是可以的,我觉得这是最后的选择)

But It clearly mentioned in number of sites that, 'IN query' has lots of performance hiccups, So what is your suggestion in this Data Model of 'nodedata' table mentioned above? (NOTE: Doing Multiple queries in different partitions are okay which I feel like a last option)

您是否有更好的方法(或)以更好的方式重新设计此数据模型(或)有没有更好的解决方案来从多个分区中检索数据?

任何帮助都会非常可观.

Any help would be really appreciable.

谢谢,
哈利

推荐答案

是的,不鼓励在分区键上使用 IN,因为它会给协调节点带来更多负载,特别是如果许多分区将在 IN 子句中指定.例如,多个单独的异步请求甚至可以提高性能,并减少协调节点的负载.

Yes, the use of IN on the partition key is discouraged as it put more load on coordinating node, especially if many partitions will be specified in IN clause. Multiple separate requests done async, for example, could even be more performant, and make less load on coordinating nodes.

此外,您需要考虑分区的大小 - 从快速查看到架构,如果您每分钟进行一次采样,我发现每个分区将在一年内增长到约 55Mb.分区太宽可能也会导致一些性能问题(尽管并非总是如此,取决于用例).也许您需要更改分区键以包含年或年+月以制作更小的分区.但在这种情况下,当您检索跨越数年/数月的数据时,应将一些额外的逻辑添加到您的代码中.

Also, you need into account the size of your partitions - from quick look to schema, I see that every partition will grow to ~55Mb in one year if you're doing sampling every minute. Having too wide partitions could also lead to some performance problems (although not always, depends on the use case). Maybe you'll need to change partition key to include year, or year+month to make smaller partitions. But in this case, some additional logic should be added to your code when you retrieve data that span several years/months.

附言也许这不能完全回答您的问题,但评论字段太小了:-)

P.S. Maybe this is not a fully answering to your question, but commentary field is too small for it :-)

这篇关于在 Cassandra 中跨分区聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆