Cassandra中跨分区的聚合 [英] Aggregation in Cassandra across partitions

查看:51
本文介绍了Cassandra中跨分区的聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据模型,

I have a Data model like below,

CREATE TABLE appstat.nodedata (
    nodeip text,
    timestamp timestamp,
    flashmode text,
    physicalusage int,
    readbw int,
    readiops int,
    totalcapacity int,
    writebw int,
    writeiops int,
    writelatency int,
    PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)

其中,nodeip-主键和时间戳-群集键(通过排序以获取最新信息),

where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),

此表中的样本数据

SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;

 nodeip       | timestamp                       | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
 172.30.56.60 | 2017-12-08 06:13:07.161000+0000 |       yes |            34 |     57 |       19 |            27 |       8 |        89 |           57
 172.30.56.60 | 2017-12-08 06:12:07.161000+0000 |       yes |            70 |      6 |       43 |            88 |      79 |        83 |           89

这是正确可用的,每当我需要获取统计信息时,我都可以使用

This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

还成功汇总了如下数据,

Also successfully aggregate the data like below,

SELECT sum(readbw) FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

现在是下一个用例,我需要在其中获取集群数据(所有四个节点的数据)

像下面一样,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.62','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;

但是在许多网站中明确提到, IN查询有很多性能问题,因此您在上述 nodedata表的数据模型中有何建议? (注意:可以在不同的分区中执行多个查询,我觉得这是最后的选择)

But It clearly mentioned in number of sites that, 'IN query' has lots of performance hiccups, So what is your suggestion in this Data Model of 'nodedata' table mentioned above? (NOTE: Doing Multiple queries in different partitions are okay which I feel like a last option)

您是否有更好的方法(或重新设计了此方法)更好的方法(或)从多个分区中检索数据的更好的解决方案?

任何帮助都是非常有意义的。

Any help would be really appreciable.

谢谢,

哈里

Thanks,
Harry

推荐答案

是的,使用不建议在分区键上使用 IN ,因为这会给协调节点增加负担,特别是如果在 IN 子句。例如,异步完成的多个单独请求甚至可以提高性能,并减少协调节点上的负载。

Yes, the use of IN on the partition key is discouraged as it put more load on coordinating node, especially if many partitions will be specified in IN clause. Multiple separate requests done async, for example, could even be more performant, and make less load on coordinating nodes.

此外,您还需要考虑分区的大小-从快速浏览到架构,我发现如果每分钟进行一次采样,那么每个分区将在一年内增长到约55Mb。分区太宽也可能导致一些性能问题(尽管并非总是如此,取决于使用情况)。也许您需要将分区键更改为包括年或年+月,以创建更小的分区。但是在这种情况下,当您检索跨越数年/数月的数据时,应该在代码中添加一些其他逻辑。

Also, you need into account the size of your partitions - from quick look to schema, I see that every partition will grow to ~55Mb in one year if you're doing sampling every minute. Having too wide partitions could also lead to some performance problems (although not always, depends on the use case). Maybe you'll need to change partition key to include year, or year+month to make smaller partitions. But in this case, some additional logic should be added to your code when you retrieve data that span several years/months.

P.S。也许这还不能完全回答您的问题,但是评论字段太小了:-)

P.S. Maybe this is not a fully answering to your question, but commentary field is too small for it :-)

这篇关于Cassandra中跨分区的聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆