有效存储时间序列数据:mySQL还是平面文件?许多表(或文件)或具有WHERE条件的查询? [英] Efficiently storing time series data: mySQL or flat files? Many tables (or files) or queries with WHERE condition?

查看:98
本文介绍了有效存储时间序列数据:mySQL还是平面文件?许多表(或文件)或具有WHERE条件的查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

存储数千个(但可能很快会变成数百万个)实际硬件传感器的时间序列数据的最佳方法是什么?传感器本身是不同的,有的仅捕获一个变量,有的多达十几个.我需要每小时存储一次这些值,并且我不想删除早于x的数据,也就是说,数据只会不断增长.

What's the best way to store time series data of thousands (but could become millions soon) real-world hardware sensors? The sensors itself are different, some just capture one variable, some up to a dozen. I need to store these values every hour, and I don't want to delete data that is older than x, i.e. the data will just keep growing.

当前,我使用mySQL数据库存储这些时间序列(它还提供了一个Web前端,该前端显示了每个传感器的良好时间序列图).我为每个传感器都有一张桌子,现在总计约11000张.每个表的布局都类似于"timestamp,value1,[value2] ...".

Currently, I use a mySQL database to store these time series (which also serves a web frontend that shows nice time series graphs for every sensor). I have one table for every sensor, which right now equals about 11000 total. Each table has a layout like "timestamp, value1, [value2] ... ".

数据库的主要任务是选择(每次有人查看图表)多于插入/更新(一次一小时).用于显示图形的选择查询只是一个"SELECT * FROM $sensor_id ORDER BY timestamp",因此从我的select语句中获取信息非常简单/高效.

The main task of the database are more selects (every time sombebody looks at the graphs) than inserts/updates (once an hour). The select query for showing the graph is simply a "SELECT * FROM $sensor_id ORDER BY timestamp", so getting the info from my select statements is pretty simple/efficient.

但是,备份数据库时已经有那么多表已经出现了一些问题,因为我遇到了LOCK限制(例如mysqldump:错误:23:打开文件'./database/table_xyz.MYD'(错误代码:24)使用LOCK TABLES时").我可以解决该错误,但是显然这让我开始思考...

However, having that many tables already presents some problems when backing up the database, because I run into LOCK limits (e.g. mysqldump: Got error: 23: Out of resources when opening file './database/table_xyz.MYD' (Errcode: 24) when using LOCK TABLES"). I can get around that error, but obviously that got me thinking...

因此,真正的问题分为几个子问题:

So, the real question, broken down into sub-questions:

  • 为每个传感器配备一张桌子的方法有多糟糕?如果我有几百万个表而不是几千个表(在不久的将来我可能不得不处理那么多传感器)怎么办?
  • 正在将所有传感器的数据存储在一个合并的表中,该表具有一个额外的列来保存sensor_id,这是一种更好的方法,因为这可能会使我的select语句减慢很多(SELECT * from all_sensors WHERE sensor_id =' $ sensor_id')?请记住,不同的传感器会测量不同的事物,因此,如果我每个传感器都有自己的表,那么此表将有几十列,而不是一列到几列?
  • 我还考虑过将时间序列数据不存储在mySQL中,而是存储在平面(CSV)文件中.我在前端使用的图形库(图表)可以很好地处理CSV文件(另外,它使我可以选择下载这些文件,这将是一个奖励,但目前不是必需的).我仍然需要数据库来处理其他与前端相关的事情,但这意味着要拥有几十个表而不是11000个表(如果我们增加更多的传感器,甚至还要更多).
  • 如果我为每个表创建一个文件,那么最终我可能会遇到文件系统限制(这是ext3分区,因此每个目录限制有〜32k个文件).因此,这里也有与上述相同的问题:我是否应该将其存储在一个包含所有传感器数据的大文件中?这可能会减慢我的阅读速度,甚至更糟,因为每次有人查看图形时,制图库都需要将更大,更大的文件读入内存?
  • How bad is my approach of having one table for every sensor? What if instead of a few thousand tables, I had a few millions (I might have to deal with that many sensors in the near future)?
  • Is storing all sensors' data in one combined table with an extra column that holds the sensor_id a better approach, since it would probably slow down my select statement by a lot (SELECT * from all_sensors WHERE sensor_id='$sensor_id')? Keep in mind that different sensors measure different things, so this table would have a few dozen columns instead of just one to a few, if I every sensor has its own table?
  • I also thought about storing the time series data NOT in mySQL, but instead in flat (CSV) files. The graphing library I use for the frontend (dygraphs) deals fine with CSV files (plus it would give me the option of making these available for download, which would be a bonus but is not a requirement currently). I still need the database for other front-end related things, but it would mean having a few dozen tables instead of 11000 (or even more if we add more sensors).
  • If I create one file for every table, then I would probably run into filesystem limits eventually (this is an ext3 partition, so there's the ~32k files per directory limit). So also here the same question as above applies: should I then store it in one large file that holds all sensors' data? This would probably slow down my reads even worse, as the graphing libary would need to read a much,much bigger file into memory every time someone looks at a graph?

你会怎么做?

谢谢!

推荐答案

要回答此问题,我们必须首先分析您面临的真实问题.

To answer this question, we must first analyse the real issue you're facing.

真正的问题是写入和检索数据的最有效组合.

The real issue would be the most efficient combination of writing and retrieving data.

让我们回顾一下您的结论:

Let's review your conclusions:

  • 成千上万张表-很好,这违反了数据库的目的,使使用起来更加困难.您也一无所获.仍然涉及磁盘搜索,这次使用了许多文件描述符.您还必须知道表名,其中有成千上万个.提取数据(这就是数据库的用途)也很困难-以一种可以轻松地交叉引用记录的方式来构造数据.成千上万的表格-从perf效率不高.观点看法.从使用的角度来看效率不高.错误的选择.

  • thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.

一个csv文件-如果您一次需要全部内容,那么它对于获取数据可能非常有用.但是,对于操作或转换数据远非如此.考虑到您依赖于特定的布局,因此在写入CSV时必须格外小心.如果这增长到成千上万个CSV文件,那么您对自己没有帮助.您消除了SQL的所有开销(不是很大),但是对于检索部分数据集却什么也没做.您在获取历史数据或交叉引用任何内容时也遇到问题.错误的选择.

a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

理想的方案是能够以高效,快捷的方式访问数据集的任何部分,而无需进行任何类型的结构更改.

The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

这正是我们使用关系数据库的原因,以及为什么我们将具有大量RAM的整个服务器专用于这些数据库的原因.

And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

在您的情况下,您正在使用MyISAM表(.MYD文件扩展名). 这是一种旧的存储格式,对于过去使用的低端硬件非常有用.但是,这些天来,我们拥有出色而快速的计算机.这就是为什么我们使用InnoDB并允许它使用大量RAM,从而降低了I/O成本的原因.控制该变量的相关变量称为innodb_buffer_pool_size-使用谷歌搜索将产生有意义的结果.

In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

要回答这个问题-一种有效且可满足的解决方案是使用一个表存储传感器信息(标识,标题,描述),而另一个表存储传感器读数.您分配了足够的RAM或足够快的存储(SSD).这些表如下所示:

To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

CREATE TABLE sensors ( 
    id int unsigned not null auto_increment,
    sensor_title varchar(255) not null,
    description varchar(255) not null,
    date_created datetime,
    PRIMARY KEY(id)
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

CREATE TABLE sensor_readings (
    id int unsigned not null auto_increment,
    sensor_id int unsigned not null,
    date_created datetime,
    reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
    PRIMARY KEY(id),
    FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

默认情况下,InnoDB使用一个平面文件进行整个数据库/安装.这缓解了超出OS/文件系统的文件描述符限制的问题.如果要分配5到6 GB的RAM来将工作数据集保存在内存中,那么几甚至几千万条记录都不是问题,这将使您可以快速访问数据.

InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

如果我要设计这样的系统,这是我(个人)要采用的第一种方法.从那里开始,很容易根据需要对这些信息进行调整.

If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

这篇关于有效存储时间序列数据:mySQL还是平面文件?许多表(或文件)或具有WHERE条件的查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆