Django + Postgres + 大时间序列 [英] Django + Postgres + Large Time Series

查看:20
本文介绍了Django + Postgres + 大时间序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个包含大量、大部分不可压缩的时间序列数据的项目,并想知道使用原始 SQL 的 Django + Postgres 是否是正确的调用.

I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call.

我有每小时约 2K 个对象/小时的时间序列数据.我每年存储大约 200 万行,我希望 1) 能够通过连接分割数据以进行分析,2) 能够在 Django 提供的网络上进行基本的概述工作.我认为最好的想法是将 Django 用于对象本身,但使用原始 SQL 来处理相关的大型时间序列数据.我认为这是一种混合方法;这可能是一个危险信号,但对一长串数据样本使用完整的 ORM 感觉有点矫枉过正.有没有更好的办法?

I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I store, and I would like to 1) be able to slice off data for analysis through a connection, 2) be able to do elementary overview work on the web, served by Django. I think the best idea is to use Django for the objects themselves, but drop to raw SQL to deal with the large time series data associated. I see this as a hybrid approach; that might be a red flag, but using the full ORM for a long series of data samples feels like overkill. Is there a better way?

推荐答案

如果我理解你的想法是正确的,你正在考虑将时间序列存储在 PostgreSQL 中,一个时间序列记录在一个数据库行中.不要那样做.

If I understand your thoughts correctly, you are considering storing the time series in PostgreSQL, one time series record in one database row. Don't do that.

一方面,这个问题是理论上的.关系数据库(我认为大多数数据库)是基于行独立的前提,而时间序列的记录是物理排序的.当然,数据库索引为数据库表提供了某种顺序,但该顺序是为了加快搜索速度或按字母顺序或其他顺序显示结果;它并不意味着该命令的任何自然含义.无论您如何订购,每个客户都独立于其他客户,每个客户的购买也独立于他的其他购买,即使您可以按时间顺序完全获取它们以形成客户的购买历史.时间序列记录的相互依赖性强得多,这使得关系数据库不合适.

On the one hand, the problem is theoretical. Relational databases (and I think most databases) are based on the premise of row independence, whereas the records of a time series are physically ordered. Of course, database indexes provide some order for database tables, but that order is meant to speed up searching or to present results alphabetically or in some other order; it does not imply any natural meaning to that order. Regardless how you order them, each customer is independent of other customers, and each customer's purchase is independent of his other purchases, even if you can get them altogether chronologically in order to form the customer's purchase history. The interdependence of time series records is much stronger, which makes relational databases inappropriate.

实际上,这意味着表及其索引占用的磁盘空间会很大(可能比将时间序列存储在文件中大20倍),并且从数据库读取时间序列会很慢,有些东西就像存储在文件中慢一个数量级.它也不会给你任何重要的好处.您可能永远不会进行查询给我值大于 X 的所有时间序列记录".如果您需要这样的查询,您还需要大量其他分析,而关系数据库并未设计用于执行这些分析,因此您无论如何都会将整个时间序列读入某个对象.

In practice, this means that the disk space taken up by the table and its indexes will be huge (maybe 20 times larger than storing the time series in files), and reading time series from the database will be very slow, something like an order of magnitude slower than storing in files. It will also not give you any important benefit. You probably aren't ever going to make the query "give me all time series records whose value is larger than X". If you ever need such a query, you will also need a hell of other analysis which the relational database has not been designed to perform, so you will read the entire time series into some object anyway.

所以每个时间序列都应该存储为一个文件.它可能是文件系统上的一个文件,也可能是数据库中的一个 blob.尽管我已经实现了后者,我相信前者更好;在 Django 中,我会这样写:

So each time series should be stored as a file. It might be either a file on the file system, or a blob in the database. Despite the fact that I've implemented the latter, I believe the former is better; in Django, I'd write something like this:

class Timeseries(models.model):
    name = models.CharField(max_length=50)
    time_step = models.ForeignKey(...)
    other_metadata = models.Whatever(...)
    data = models.FileField(...)

使用 FileField 将使您的数据库更小,并使系统的增量备份更容易.通过在文件中查找来获取切片也会更容易,这对于使用 blob 可能是不可能或困难的.

Using a FileField will make your database smaller and make it easier to make incremental backups of your system. It will also be easier to get slices by seeking in the file, something that's probably impossible or difficult with a blob.

现在,什么样的文件?我建议你看看熊猫.它是一个用于数学分析的python库,支持时间序列,也应该有一种将时间序列存储在文件中的方法.

Now, what kind of file? I'd advise you to take a look at pandas. It's a python library for mathematical analysis that has support for time series, and it should also have a way to store time series in files.

我在上面链接到了我不建议您使用的图书馆;一方面它没有做你想要的(它不能处理超过一分钟的粒度,而且还有其他缺点),另一方面它已经过时了——我在熊猫之前写的,我打算转换它将来使用熊猫.Pandas 的作者有一本书用于数据分析的 Python",我发现它非常宝贵.

I linked above to a library of mine which I don't recommend you to use; on the one hand it doesn't do what you want (it can't handle granularity finer than a minute, and it has other shortcomings), and on the other it's outdated - I wrote it before pandas, and I intend to convert it to use pandas in the future. There's a book, "Python for data analysis", by the author of pandas, which I've found invaluable.

更新(2016 年): 还有 InfluxDB.从未使用过它,因此我没有意见,但如果您想知道如何存储时间序列,这绝对是您需要检查的东西.

Update (2016): There's also InfluxDB. Never used it and therefore I have no opinion, but it is definitely something that you need to examine if you are wondering how to store time series.

更新 (2020-02-07): 还有 TimescaleDB,一个对 PostgreSQL 的扩展.

Update (2020-02-07): There's also TimescaleDB, an extension to PostgreSQL.

更新 (2020-08-07):我们(再次)更改了我们的软件,以便它使用 TimescaleDB.我们已经精通 PostgreSQL 并且很容易学习一些 TimescaleDB.最重要的具体优势是,我们可以进行诸如查找 2019 年 24 小时内降雨量大于 50 毫米的所有位置"之类的查询,这在将数据存储在平面文件中时会非常困难.另一个优点是完整性检查——多年来,由于到处都是小错误,我们有几个时间序列有重复的行.缺点也很明显.它使用 10 倍以上的磁盘空间.因此,我们可能需要更改我们的 PostgreSQL 备份策略.它更慢.检索具有 30 万条记录的时间序列可能需要一秒钟.这是之前的瞬间.我们需要为检索时间序列实现缓存,而这在以前是不需要的.

Update (2020-08-07): We changed our software (again) so that it stores the data in the database using TimescaleDB. We are already versed in PostgreSQL and it was easy to learn some TimescaleDB. The most important concrete advantage is that we can make queries like "find all locations where there was >50mm rain within 24 hours in 2019", something that would be very difficult when storing data in flat files. Another advantage is the integrity checks—over the years we had a few time series with duplicate rows because of little bugs here and there. The drawbacks are also significant. It uses 10 times more disk space. We may need to change our PostgreSQL backup policy because of that. It's slower. It takes maybe one second to retrieve a time series with 300k records. This was instant before. We needed to implement caching for retrieving time series, which wasn't needed before.

这篇关于Django + Postgres + 大时间序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆