聚合时间序列的高效子查询 [英] Efficient subquery for aggregate time series

查看：34 发布时间：2021/9/7 20:39:21 sql postgresql time-series postgresql-performance

本文介绍了聚合时间序列的高效子查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从某个日期每天构建一个时间序列并计算每天的一些统计数据.但是这个查询很慢......有什么办法可以加快速度吗?(例如，在子查询中选择一次表并计算该表每天的各种统计数据).

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).

在代码中这看起来像

for i, day in series:
    previous_days = series[0...i]
    some_calculation_a = some_operation_on(previous_days)
    some_calculation_b = some_other_operation_on(previous_days)

以下是一个时间序列示例，用于查找截至该日期具有 <= 5 条消息的用户:

Here is an example for a time series looking for users with <= 5 messages up to that date:

with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09', 
  now(), '1 day'::interval) date
),

msgs as 
(
select days.date, 
    (select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
    (select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)

select * from msgs;

查询细分:

CTE Scan on msgs  (cost=815579.03..815583.03 rows=200 width=24)
  Output: msgs.date, msgs.lt_5, msgs.eq_1
  CTE days
    ->  Function Scan on pg_catalog.generate_series date  (cost=0.01..10.01 rows=1000 width=8)
          Output: date.date
          Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
  CTE msgs
    ->  Group  (cost=6192.62..815569.02 rows=200 width=8)
          Output: days.date, (SubPlan 2), (SubPlan 3)
          Group Key: days.date
          ->  Merge Join  (cost=6192.62..11239.60 rows=287970 width=8)
                Output: days.date
                Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
                ->  Sort  (cost=69.83..72.33 rows=1000 width=8)
                      Output: days.date
                      Sort Key: days.date
                      ->  CTE Scan on days  (cost=0.00..20.00 rows=1000 width=8)
                            Output: days.date
                ->  Sort  (cost=6122.79..6266.78 rows=57594 width=8)
                      Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
                      Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
                      ->  Seq Scan on public.daily_messages daily_messages_2  (cost=0.00..1568.94 rows=57594 width=8)
                            Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
          SubPlan 2
            ->  Aggregate  (cost=2016.89..2016.90 rows=1 width=32)
                  Output: count(daily_messages.customer_id)
                  ->  Seq Scan on public.daily_messages  (cost=0.00..2000.89 rows=6399 width=32)
                        Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
                        Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
          SubPlan 3
            ->  Aggregate  (cost=2001.13..2001.14 rows=1 width=32)
                  Output: count(daily_messages_1.customer_id)
                  ->  Seq Scan on public.daily_messages daily_messages_1  (cost=0.00..2000.89 rows=96 width=32)
                        Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
                        Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

推荐答案

除了效率很低之外，我怀疑查询也是不正确的.假设当前的 Postgres 9.6，我有根据的猜测:

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:

SELECT created_at::date
     , sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
     , sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM   daily_messages m
WHERE  created_at >= timestamptz '2015-07-09'  -- sargable!
AND    created_at <  now()                     -- probably redundant
GROUP  BY 1
WINDOW w AS (ORDER BY created_at::date);

可能不需要所有这些相关的子查询.我用结合了聚合 FILTER 子句的窗口函数替换了它.您可以在聚合函数上使用窗口函数.有更多解释的相关答案:

All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:

CTE 也无济于事(不必要的开销).您只需要一个子查询——或者甚至不需要，只需要返回集合函数 generate_series() 的结果.generate_series() 可以直接传递timestamptz.不过要注意影响.您的查询取决于会话的时区设置.详情:

The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:

在 Rails 和 PostgreSQL 中完全忽略时区

再三考虑，我完全删除了 generate_series().只要您有 INNER JOIN 到 daily_messages，结果中只会保留实际行数.根本不需要 generate_series().LEFT JOIN 有意义.问题中的信息不足.

On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.

解释sargable"的相关答案:

聚合时间序列的高效子查询 [英] Efficient subquery for aggregate time series

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

聚合时间序列的高效子查询 [英] Efficient subquery for aggregate time series

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭