如何为时间序列,服务器指标建模Cassandra DB [英] How to model Cassandra DB for Time Series, server metrics

查看:81
本文介绍了如何为时间序列,服务器指标建模Cassandra DB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的名字叫丹尼尔(Daniel),
我是新来者,但长期潜伏。
我决定为下一个让孩子们睡觉时写一些代码项目学习Apache Cassandra。

My name is Daniel, I'm a newcomer accountwise but a long time lurker. I decided to learn Apache Cassandra for my next "lets write some code while the kids are sleeping" project.

我在写的东西很整洁会针对cassandra数据库进行读写的api。
我在mongodb中发现了很多数据库布局,但是对我来说,是时候继续前进并成长为工程师了:)

What i'm writing is a neat little api that will do read and writes against a cassandra database. I had a lot of the db layout figured out in mongodb, but for me it's time to move on and grow as a engineer :)

任务:
我将从机架中的服务器收集度量,代理将每分钟发送一次度量负载。
我已经弄清楚了api部分,将使用JWT令牌对有效负载进行签名。
我将存储的数据类型如下所示。
cpuload,cpuusage,memusage,diskusage等。

Mission: I will collect metrics from the servers in my rack, an agent will send a payload of metrics every minute. I have the api part pretty much figured out, will use JWT tokens signing the payloads. The type of data i will store can be seen below. cpuload, cpuusage, memusage, diskusage etc.

我对cassandra感到困惑的部分是如何编写实际模型,我了解storagengines的种类将所有内容作为磁盘上的时间序列
进行记录,这对我来说非常令人惊奇。我知道我现在会一起鞭打的任何东西现在都可以在我的实验室工作,因为它是jsut 30台机器,
,但是我想了解这些事情是如何正确完成的,以及如何在现实生活中(例如服务器密度)完成,datadog,插入您喜欢的服务器监视服务。 :)

The part where i am confused with cassandra is how to write the actual model, i understand the storagengines sort of writes it all as a time serie on disk for me making reads quite amazing. i know anything i would whip together now would work for my lab since it's jsut 30 machines, but i'm trying to understand how these things are done properly and how it could be done for a real life scenario like server density, datadog , "insert your prefered server monitoring service". :)

但是您是如何更有经验的工程师设计这样的模式的?

But how are you more experienced engineers designing a schema like this ?


数据库的使用场景:

Usage scenarios for the database:


  • 每分钟通过api写入有效负载。 (让我们想象一下,每分钟至少有10万次写入是为了学习
    有用的东西)

  • 读取与一个用户ID相关的资产

  • write payloads every minute through the api. (lets imagine thats atleast 100k writes per minute for the sake of learning something useful)
  • Read the assets associated with ones userid


  • 提取最新数据(3小时)

  • 提取最新数据(每日)

  • 提取最新数据(每周)

  • 提取最新数据(每月)

  • 等等等

  • pull most recent data (3h)
  • pull most recent data (daily)
  • pull most recent data (weekly)
  • pull most recent data (monthly)
  • etc etc

生成每月pdf报告,显示正常运行时间等。

Generate monthly pdf reports showing uptime and such.

我应该插入包含全部有效负载的行,还是最好按服务插入行:timeuid | cpuusage

每个服务行

Should i insert the rows containing the full payload or am i better of inserting them per service basis: timeuid|cpuusage
Per service row

CREATE TABLE metrics(
    id uuid PRIMARY KEY,
    assetid int,
    serviceType text,
    metricValue int
)

多人合一

CREATE TABLE metrics(
    id uuid PRIMARY KEY,
    assetid int,
    cpuload int,
    cpuusage int,
    memusage int,
    diskusage int,
)

在mongo中,我将预先分配存储分区,并在文档内部保留快速读取平均值。
因此,在webgui中,我可以简单地显示预定义时间段内的平均统计信息。

In mongo i would preallocate the buckets, and also keep a quick read avg inside of the document. So in the webgui i could simply show the avg stats for pre-defined time periods.

对dumbasse的示例非常赞赏。
希望您能破译我那可怜的英语。

Examples for dumbasses are highly appreciated. Hope you can decipher my rather poor english.

只需在SO建议中找到此网址:
时间序列的Cassandra数据模型
i猜想这也适用于我。

Just found this url in the SO suggestions: Cassandra data model for time series i guess that is something that applies to me aswell.

真诚地
Daniel Olsson

Sincerly Daniel Olsson

推荐答案

为您的数据模型,建议将时间添加为群集列:

For your data model, I would suggest adding time as a clustering column:

CREATE TABLE metrics(
id uuid,
time timeuuid,
assetid int,
cpuload int,
cpuusage int,
memusage int,
diskusage int,
PRIMARY KEY (id, time) WITH CLUSTERING ORDER BY (time DESC))

使用降序排列最新指标。然后,您可以使用LIMIT子句查询以获取最近的小时数:

Use descending order to keep the latest metrics first. You can then query using the LIMIT clause to get the most recent hour:

SELECT * FROM metrics WHERE id = <UUID> LIMIT 60

或日期:

SELECT * FROM metrics WHERE id = <UUID> LIMIT 1440

根据计划保留数据的时间,您可能需要添加一列表的年,月或日来限制分区大小。例如,如果您希望保留3个月的数据,则可以添加 month 列以按ID和月份对密钥进行分区:

Depending upon how long you plan to keep the data, you may want to add a column for year, month, or days to the table to limit your partition size. For example, if you wish to keep data for 3 months, a month column can be added to partition your keys by id and month:

CREATE TABLE metrics(
id uuid,
time timeuuid,
month text,
assetid int,
cpuload int,
cpuusage int,
memusage int,
diskusage int,
PRIMARY KEY ((id, month), time) WITH CLUSTERING ORDER BY (time DESC))

如果您保留数据多年,请使用年+月或日期值。

If you keep data for several years, use year + month or a date value.

关于最后一个问题,关于单独的表还是单个表。 Cassandra支持稀疏列,因此您可以在公用表中为每个指标进行多次插入,而无需更新任何数据。但是,每行只写一次总是更快的。

Regarding your final question, about separate tables or a single table. Cassandra supports sparse columns, so you can make multiple inserts in a common table for each metric without updating any data. However, it's always faster to write just once per row.

如果必须通过备用键查询不同的指标,则可能需要单独的表。例如,通过ID和磁盘名称查询磁盘使用情况。您需要一个单独的表或一个实例化视图来支持该查询模式。

You may need separate tables if you have to query for different metrics by an alternative key. For example, query for disk usage by id and disk name. You'd need a separate table or a materialized view to support that query pattern.

最后,您的架构定义了一个资产,但这不是没有在主键中定义,因此对于当前架构,您无法使用assetid进行查询。

Finally, your schema defines an assetid, but this isn't defined in your primary key so with your current schema you can't query using assetid.

这篇关于如何为时间序列,服务器指标建模Cassandra DB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆