加入两个时间序列的最有效方法 [英] Most efficient way to join two time series

查看:110
本文介绍了加入两个时间序列的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下我有一个这样的表:

Imagine I have a table like this:

 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

具有这样的值:

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

我希望能够做到这一点:

I would like to be able to do this:

SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

会产生如下结果:

*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

行数很少,如本例所示,查询运行速度非常快。问题是我必须对数百万行执行此操作,并且在Redshift(语法类似于Postgres)上,我的查询需要几天才能运行。这太慢了,但这是我最常见的查询模式之一。我怀疑问题是由于数据中O(n ^ 2)的增长与更可取的O(n)的关系所致。

With a trivial number of rows, like in this example, the query runs super fast. Problem is I have to do this for millions of rows, and on Redshift (similar in syntax to Postgres) my query takes days to run. It's horribly slow, and yet this is one of my most common query patterns. I suspect that the problem is due to growth of O(n^2) in the data vs the more preferable O(n).

我在python中的O(n)实现如下:

My O(n) implementation in python would be something like this:

rows = [('2017-01-01',10),
        ('2017-01-02',4),
        ('2017-01-03',13),
        ('2017-01-04',7),
        ('2017-01-05',15),
        ('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
    snapshot_date = row[0]
    sales = row[1]
    if index == 0:
        sales_total_previous += sales
        continue
    count += 1
    sales_avg = sales_total_previous / count
    print((snapshot_date,sales_avg, count))
    sales_total_previous += sales

具有这样的结果(与SQL查询相同):

With results like this (same as SQL query):

('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)

我正在考虑改用Apache Spark,我可以完全执行该python查询,但实际上没有几百万行那么大(最多3-4 GB),并且使用具有100 GB RAM的Spark集群似乎有点过头了。有没有一种有效且易读的方式可以使我在SQL中获得O(n)效率,最好是在Postgres / Redshift中?

I'm considering switching to Apache Spark so that I can do exactly that python query, but several million rows is isn't really that large (it's at most 3-4 GB) and using a Spark cluster with 100 GB of RAM seems like overkill. Is there an efficient and easy-to-read way I can get O(n) efficiency in SQL, preferably in Postgres / Redshift?

推荐答案

您似乎想要:

SELECT ts.snapshot_date, 
       AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg,
       ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT
FROM time_series ts;

您会发现使用窗口函数效率更高。

You'll find that using window functions is much more efficient.

这篇关于加入两个时间序列的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆