聚合时间序列的高效子查询 [英] Efficient subquery for aggregate time series

查看:34
本文介绍了聚合时间序列的高效子查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从某个日期每天构建一个时间序列并计算每天的一些统计数据.但是这个查询很慢......有什么办法可以加快速度吗?(例如,在子查询中选择一次表并计算该表每天的各种统计数据).

I want to build a time series daily from a certain date and calculate a few statistics for each day. However this query is very slow... Any way to speed it up? (for example, select the table once in the subquery and compute various stats on that table for each day).

在代码中这看起来像

for i, day in series:
    previous_days = series[0...i]
    some_calculation_a = some_operation_on(previous_days)
    some_calculation_b = some_other_operation_on(previous_days)

以下是一个时间序列示例,用于查找截至该日期具有 <= 5 条消息的用户:

Here is an example for a time series looking for users with <= 5 messages up to that date:

with
days as
(
select date::Timestamp with time zone from generate_series('2015-07-09', 
  now(), '1 day'::interval) date
),

msgs as 
(
select days.date, 
    (select count(customer_id) from daily_messages where sum < 5 and date_trunc('day'::text, created_at) <= days.date) as LT_5,
    (select count(customer_id) from daily_messages where sum = 1 and date_trunc('day'::text, created_at) <= days.date) as EQ_1
from days, daily_messages
where date_trunc('day'::text, created_at) = days.date
group by days.date
)

select * from msgs;

查询细分:

CTE Scan on msgs  (cost=815579.03..815583.03 rows=200 width=24)
  Output: msgs.date, msgs.lt_5, msgs.eq_1
  CTE days
    ->  Function Scan on pg_catalog.generate_series date  (cost=0.01..10.01 rows=1000 width=8)
          Output: date.date
          Function Call: generate_series('2015-07-09 00:00:00+00'::timestamp with time zone, now(), '1 day'::interval)
  CTE msgs
    ->  Group  (cost=6192.62..815569.02 rows=200 width=8)
          Output: days.date, (SubPlan 2), (SubPlan 3)
          Group Key: days.date
          ->  Merge Join  (cost=6192.62..11239.60 rows=287970 width=8)
                Output: days.date
                Merge Cond: (days.date = (date_trunc('day'::text, daily_messages_2.created_at)))
                ->  Sort  (cost=69.83..72.33 rows=1000 width=8)
                      Output: days.date
                      Sort Key: days.date
                      ->  CTE Scan on days  (cost=0.00..20.00 rows=1000 width=8)
                            Output: days.date
                ->  Sort  (cost=6122.79..6266.78 rows=57594 width=8)
                      Output: daily_messages_2.created_at, (date_trunc('day'::text, daily_messages_2.created_at))
                      Sort Key: (date_trunc('day'::text, daily_messages_2.created_at))
                      ->  Seq Scan on public.daily_messages daily_messages_2  (cost=0.00..1568.94 rows=57594 width=8)
                            Output: daily_messages_2.created_at, date_trunc('day'::text, daily_messages_2.created_at)
          SubPlan 2
            ->  Aggregate  (cost=2016.89..2016.90 rows=1 width=32)
                  Output: count(daily_messages.customer_id)
                  ->  Seq Scan on public.daily_messages  (cost=0.00..2000.89 rows=6399 width=32)
                        Output: daily_messages.created_at, daily_messages.customer_id, daily_messages.day_total, daily_messages.sum, daily_messages.elapsed
                        Filter: ((daily_messages.sum < '5'::numeric) AND (date_trunc('day'::text, daily_messages.created_at) <= days.date))
          SubPlan 3
            ->  Aggregate  (cost=2001.13..2001.14 rows=1 width=32)
                  Output: count(daily_messages_1.customer_id)
                  ->  Seq Scan on public.daily_messages daily_messages_1  (cost=0.00..2000.89 rows=96 width=32)
                        Output: daily_messages_1.created_at, daily_messages_1.customer_id, daily_messages_1.day_total, daily_messages_1.sum, daily_messages_1.elapsed
                        Filter: ((daily_messages_1.sum = '1'::numeric) AND (date_trunc('day'::text, daily_messages_1.created_at) <= days.date))

推荐答案

除了效率很低之外,我怀疑查询也是不正确的.假设当前的 Postgres 9.6,我有根据的猜测:

In addition to being very inefficient, I suspect the query is also incorrect. Assuming current Postgres 9.6, my educated guess:

SELECT created_at::date
     , sum(count(customer_id) FILTER (WHERE sum < 5)) OVER w AS lt_5
     , sum(count(customer_id) FILTER (WHERE sum = 1)) OVER w AS eq_1
FROM   daily_messages m
WHERE  created_at >= timestamptz '2015-07-09'  -- sargable!
AND    created_at <  now()                     -- probably redundant
GROUP  BY 1
WINDOW w AS (ORDER BY created_at::date);

可能不需要所有这些相关的子查询.我用结合了聚合 FILTER 子句的窗口函数替换了它.您可以在聚合函数上使用窗口函数.有更多解释的相关答案:

All those correlated subqueries are probably not needed. I replaced it with window functions combined with aggregate FILTER clauses. You can have a window function over an aggregate function. Related answers with more explanation:

CTE 也无济于事(不必要的开销).您只需要一个子查询——或者甚至不需要,只需要返回集合函数 generate_series() 的结果.generate_series() 可以直接传递timestamptz.不过要注意影响.您的查询取决于会话的时区设置.详情:

The CTEs don't help either (unnecessary overhead). You only would need a single subquery - or not even that, just the result from the set-returning function generate_series(). generate_series() can deliver timestamptz directly. Be aware of implications, though. You query depends on the time zone setting of the session. Details:

再三考虑,我完全删除了 generate_series().只要您有 INNER JOINdaily_messages,结果中只会保留实际行数.根本不需要 generate_series().LEFT JOIN 有意义.问题中的信息不足.

On second thought, I removed generate_series() completely. As long as you have an INNER JOIN to daily_messages, only days with actual rows remain in the result anyway. No need for generate_series() at all. Would make sense with LEFT JOIN. Not enough information in the question.

解释sargable"的相关答案:

Related answer explaining "sargable":

您可以将 count(customer_id) 替换为 count(*).问题中的信息不足.

You might replace count(customer_id) with count(*). Not enough information in the question.

可能会进一步优化,但没有足够的信息在答案中更具体.

Might be optimized further, but there is not enough information to be more specific in the answer.

SELECT day
     , sum(lt_5_day) OVER w AS lt_5
     , sum(eq_1_day) OVER w AS eq_1
FROM  (
   SELECT day::date
   FROM   generate_series(date '2015-07-09', current_date, interval '1 day') day
   ) d
LEFT   JOIN (
   SELECT created_at::date AS day
        , count(customer_id) FILTER (WHERE sum < 5) AS lt_5_day
        , count(customer_id) FILTER (WHERE sum = 1) AS eq_1_day
   FROM   daily_messages m
   WHERE  created_at >= timestamptz '2015-07-09'
   GROUP  BY 1
   ) m USING (day)
WINDOW w AS (ORDER BY day);

  1. 在子查询 m 中汇总每日总和.
  2. 在子查询 d 中生成时间范围内所有天的系列.
  3. 使用 LEFT [OUTER] JOIN 保留结果中的所有天数,即使当天没有新行.
  1. Aggregate daily sums in subquery m.
  2. Generate series of all days in time range in subquery d.
  3. Use LEFT [OUTER] JOIN to retain all days in the result, even without new rows for the day.

这篇关于聚合时间序列的高效子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆