Django + Postgres +大时间系列 [英] Django + Postgres + Large Time Series

查看:101
本文介绍了Django + Postgres +大时间系列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑一个大型,大多数是不可压缩的时间序列数据的项目,并且想知道Django + Postgres是否是原始SQL是正确的。



我有时间序列数据为〜2K个对象/小时,每小时。这是我每年约200万行,我想要1)能够通过连接分割数据进行分析,2)能够在Django提供的网络上进行基本概述工作。我认为最好的想法是使用Django作为对象本身,而是将原来的SQL应用到处理大型时间序列数据的关联中。我认为这是一种混合的方法;这可能是一个红旗,但是对于一系列长的数据样本,使用完整的ORM感觉像过度杀伤力。有没有更好的方法?

解决方案

如果我正确理解你的想法,你正在考虑将时间序列存储在PostgreSQL中,一次系列记录在一个数据库行中。不要这样做。



一方面,问题是理论上的。关系数据库(我认为大多数数据库)都是基于行独立的前提,而时间序列的记录是物理上的。当然,数据库索引为数据库表提供了一些顺序,但是这个顺序旨在加速搜索或按字母顺序显示结果或按其他顺序显示结果;这并不意味着该订单有任何自然的意义。无论您如何订购它们,每个客户都独立于其他客户,每个客户的购买都是独立于他的其他购买,即使您可以按时间顺序获得它们,以形成客户的购买历史。时间序列记录的相互依赖性要强得多,这使得关系数据库不适当。



实际上,这意味着由表及其索引占用的磁盘空间将是巨大的(可能比存储文件中的时间序列大20倍),并且从数据库读取时间序列将非常慢,比存储文件要慢一些数量级。它也不会给你任何重要的好处。你可能不会让查询给我所有的值大于X的时间序列记录。如果您需要这样的查询,您还需要一个关键数据库没有设计执行的其他分析,所以您将会将整个时间序列读入某个对象。



所以每个时间序列应该存储为一个文件。它可能是文件系统上的文件,也可能是数据库中的一个blob。尽管我已经实施后者我相信前者更好;在Django中,我会写这样的东西:

  class Timeseries(models.model):
name = models .CharField(max_length = 50)
time_step = models.ForeignKey(...)
other_metadata = models.Whatever(...)
data = models.FileField(...)

使用 FileField 将使您的数据库更小,使您更容易进行系统的增量备份。通过查找文件也可以更容易地获取切片,这可能是不可能的或困难的。b

现在,什么样的文件?我建议你看看熊猫。它是一个用于数学分析的python库,它支持时间序列,它还应该有一种方法来存储文件中的时间序列。



我将以上链接到我不建议你使用的;一方面它不会做你想要的(它不能处理比​​一分钟更精细的粒度,而且有其他缺点),另一方面它已经过时了 - 我在大熊猫之前写过,我打算转换它在将来使用大熊猫。有一本书,数据分析的Python,作者是熊猫作者,我发现这是非常宝贵的。



更新:还有InfluxDB。从来没有使用过它,因此我没有任何意见,但如果你想知道如何存储时间序列,那么这绝对是你需要检查的。


I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call.

I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I store, and I would like to 1) be able to slice off data for analysis through a connection, 2) be able to do elementary overview work on the web, served by Django. I think the best idea is to use Django for the objects themselves, but drop to raw SQL to deal with the large time series data associated. I see this as a hybrid approach; that might be a red flag, but using the full ORM for a long series of data samples feels like overkill. Is there a better way?

解决方案

If I understand your thoughts correctly, you are considering storing the time series in PostgreSQL, one time series record in one database row. Don't do that.

On the one hand, the problem is theoretical. Relational databases (and I think most databases) are based on the premise of row independence, whereas the records of a time series are physically ordered. Of course, database indexes provide some order for database tables, but that order is meant to speed up searching or to present results alphabetically or in some other order; it does not imply any natural meaning to that order. Regardless how you order them, each customer is independent of other customers, and each customer's purchase is independent of his other purchases, even if you can get them altogether chronologically in order to form the customer's purchase history. The interdependence of time series records is much stronger, which makes relational databases inappropriate.

In practice, this means that the disk space taken up by the table and its indexes will be huge (maybe 20 times larger than storing the time series in files), and reading time series from the database will be very slow, something like an order of magnitude slower than storing in files. It will also not give you any important benefit. You probably aren't ever going to make the query "give me all time series records whose value is larger than X". If you ever need such a query, you will also need a hell of other analysis which the relational database has not been designed to perform, so you will read the entire time series into some object anyway.

So each time series should be stored as a file. It might be either a file on the file system, or a blob in the database. Despite the fact that I've implemented the latter, I believe the former is better; in Django, I'd write something like this:

class Timeseries(models.model):
    name = models.CharField(max_length=50)
    time_step = models.ForeignKey(...)
    other_metadata = models.Whatever(...)
    data = models.FileField(...)

Using a FileField will make your database smaller and make it easier to make incremental backups of your system. It will also be easier to get slices by seeking in the file, something that's probably impossible or difficult with a blob.

Now, what kind of file? I'd advise you to take a look at pandas. It's a python library for mathematical analysis that has support for time series, and it should also have a way to store time series in files.

I linked above to a library of mine which I don't recommend you to use; on the one hand it doesn't do what you want (it can't handle granularity finer than a minute, and it has other shortcomings), and on the other it's outdated - I wrote it before pandas, and I intend to convert it to use pandas in the future. There's a book, "Python for data analysis", by the author of pandas, which I've found invaluable.

Update: There's also InfluxDB. Never used it and therefore I have no opinion, but it is definitely something that you need to examine if you are wondering how to store time series.

这篇关于Django + Postgres +大时间系列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆