存储时间序列数据,关系型还是非关系型? [英] Storing time-series data, relational or non?

查看:16
本文介绍了存储时间序列数据,关系型还是非关系型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个系统,该系统使用 SNMP 以(可能)5 分钟间隔轮询设备以获取有关 CPU 利用率、磁盘利用率、温度等不同指标的数据.最终目标是以时间序列图的形式向系统用户提供可视化.

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.

我过去曾考虑使用 RRDTool,但拒绝了它,因为无限期地存储捕获的数据对我的项目很重要,并且我希望更高级别和更灵活地访问捕获的数据.所以我的问题是:

I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:

在查询图形数据时的性能方面,关系数据库(如 MySQL 或 PostgreSQL)或非关系或 NoSQL 数据库(如 MongoDB 或 Redis)更好.

给定一个关系数据库,我将使用 data_instances 表,其中将存储为所有设备测量的每个指标捕获的每个数据实例,并具有以下字段:

Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:

字段:id fk_to_device fk_to_metric metric_value timestamp

当我想为特定设备上的特定指标绘制图表时,我必须查询这个单一的表过滤掉其他设备,以及正在为此设备分析的其他指标:

When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:

SELECT metric_value, timestamp FROM data_instances
    WHERE fk_to_device=1 AND fk_to_metric=2

此表中的行数为:

d * m_d * f * t

其中 d设备 的数量,m_d 是所有记录的累计 指标数量设备,f 是轮询数据的 频率t 是系统的总 时间一直在收集数据.

where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.

对于一年内每 5 分钟为 3 台设备记录 10 个指标的用户,我们的记录将接近 500 万.

For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.

如果在 fk_to_devicefk_to_metric 上没有索引,扫描这个不断扩展的表会花费太多时间.因此,索引上述字段和 timestamp(用于创建具有本地化周期的图形)是一项要求.

Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.

MongoDB 具有 集合 的概念,与表不同,这些表可以通过编程方式创建而无需设置.有了这些,我可以对每个设备的数据存储进行分区,甚至可以对每个设备记录的每个指标进行分区.

MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.

我没有使用 NoSQL 的经验,不知道它们是否提供任何查询性能增强功能,例如索引,但是上一段建议在 NoSQL 下存储数据的结构中完成大部分传统的关系查询工作.

I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.

具有正确索引的关系解决方案会在一年内减少吗?或者 NoSQL 方法的基于集合的结构(与我存储数据的心智模型相匹配)是否提供了明显的好处?

Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

推荐答案

绝对关系.无限的灵活性和扩展性.

Definitely Relational. Unlimited flexibility and expansion.

在概念和应用方面进行了两次更正,然后是提升.

Two corrections, both in concept and application, followed by an elevation.

  1. 不是过滤掉不需要的数据";它只选择所需的数据.是的,当然,如果你有一个Index来支持WHERE子句中标识的列,它是非常快的,并且查询不依赖于表的大小(从160亿行的表中抓取1000行是瞬时的).

  1. It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).

你的桌子有一个严重的障碍.根据您的描述,实际的 PK 是(设备、公制、日期时间).(请不要称它为 TimeStamp,这意味着其他东西,但这是一个小问题.) 的唯一性由:

Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:

   (Device, Metric, DateTime)

  • Id 列什么都不做,它完全是多余的.

    • The Id column does nothing, it is totally and completely redundant.

      • Id 列绝不是 Key(在关系数据库中禁止的重复行,必须通过其他方式防止).
      • Id 列需要额外的Index,这明显阻碍了INSERT/DELETE 的速度,并增加了使用的磁盘空间.

      • An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
      • The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.

      你可以摆脱它.请.

      海拔

      1. 既然你已经移除了障碍,你可能还没有认出它,但你的桌子是在第六范式.速度非常快,PK上只有一个Index.为了理解,请阅读这个答案什么是第六范式? 开始.

      • (我只有一个索引,而不是三个;在非 SQL 上,您可能需要三个索引).

      • (I have one index only, not three; on the Non-SQLs you may need three indices).

      我有完全相同的表(当然,没有 Id 键").我有一个附加列Server.我远程支持多个客户.

      I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.

      (服务器、设备、指标、日期时间)

      该表可用于使用完全相同的 SQL 代码(是,切换单元格).我使用该表为客户建立了无数种图形和图表,以了解他们的服务器性能.

      The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.

      • 监控统计数据模型.
        (内联太大;某些浏览器无法加载内联;点击链接.这也是过时的演示版本,出于显而易见的原因,我无法向您展示商业产品 DM.)

      • Monitor Statistics Data Model.
        (Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)

      它允许我生成 类似这样的图表,在收到客户的原始监控统计文件后使用单个 SELECT 命令敲击六次键.注意混搭;操作系统和服务器在同一张图表上;各种枢轴.当然,统计矩阵和图表的数量没有限制.(经客户许可使用.)

      It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)

      不熟悉关系数据库建模标准的读者可能会发现 IDEF1X 表示法很有帮助.

      Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.

      还有一件事

      最后但同样重要的是,SQL 是 IEC/ISO/ANSI 标准.免费软件实际上是非 SQL;如果他们不提供标准,则使用 SQL 一词是欺诈性的.它们可能提供额外",但缺少基础.

      Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

      这篇关于存储时间序列数据,关系型还是非关系型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆