加入两个时间序列的最有效方法 [英] Most efficient way to join two time series

查看：110 发布时间：2020/5/30 0:45:40 python sql postgresql amazon-redshift

本文介绍了加入两个时间序列的最有效方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

想象一下我有一个这样的表：

Imagine I have a table like this:

 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

具有这样的值：

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

我希望能够做到这一点：

I would like to be able to do this:

SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

会产生如下结果：

*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

行数很少，如本例所示，查询运行速度非常快。问题是我必须对数百万行执行此操作，并且在Redshift（语法类似于Postgres）上，我的查询需要几天才能运行。这太慢了，但这是我最常见的查询模式之一。我怀疑问题是由于数据中O（n ^ 2）的增长与更可取的O（n）的关系所致。

With a trivial number of rows, like in this example, the query runs super fast. Problem is I have to do this for millions of rows, and on Redshift (similar in syntax to Postgres) my query takes days to run. It's horribly slow, and yet this is one of my most common query patterns. I suspect that the problem is due to growth of O(n^2) in the data vs the more preferable O(n).

我在python中的O（n）实现如下：

My O(n) implementation in python would be something like this:

rows = [('2017-01-01',10),
        ('2017-01-02',4),
        ('2017-01-03',13),
        ('2017-01-04',7),
        ('2017-01-05',15),
        ('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
    snapshot_date = row[0]
    sales = row[1]
    if index == 0:
        sales_total_previous += sales
        continue
    count += 1
    sales_avg = sales_total_previous / count
    print((snapshot_date,sales_avg, count))
    sales_total_previous += sales

具有这样的结果（与SQL查询相同）：

With results like this (same as SQL query):

('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)

我正在考虑改用Apache Spark，我可以完全执行该python查询，但实际上没有几百万行那么大（最多3-4 GB），并且使用具有100 GB RAM的Spark集群似乎有点过头了。有没有一种有效且易读的方式可以使我在SQL中获得O（n）效率，最好是在Postgres / Redshift中？

I'm considering switching to Apache Spark so that I can do exactly that python query, but several million rows is isn't really that large (it's at most 3-4 GB) and using a Spark cluster with 100 GB of RAM seems like overkill. Is there an efficient and easy-to-read way I can get O(n) efficiency in SQL, preferably in Postgres / Redshift?

加入两个时间序列的最有效方法 [英] Most efficient way to join two time series

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

加入两个时间序列的最有效方法 [英] Most efficient way to join two time series

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭