在CTE上以时间间隔缓慢左联接 [英] Slow LEFT JOIN on CTE with time intervals

查看:99
本文介绍了在CTE上以时间间隔缓慢左联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在PostgreSQL中调试一个查询,该查询是为在任意时间间隔中的时间段中存储市场数据而构建的。这是我的表定义:

I am trying to debug a query in PostgreSQL that I've built to bucket market data in time buckets in arbitrary time intervals. Here is my table definition:

CREATE TABLE historical_ohlcv (
  exchange_symbol TEXT                     NOT NULL,
  symbol_id       TEXT                     NOT NULL,
  kafka_key       TEXT                     NOT NULL,
  open            NUMERIC,
  high            NUMERIC,
  low             NUMERIC,
  close           NUMERIC,
  volume          NUMERIC,
  time_open       TIMESTAMP WITH TIME ZONE NOT NULL,
  time_close      TIMESTAMP WITH TIME ZONE,
  CONSTRAINT historical_ohlcv_pkey
  PRIMARY KEY (exchange_symbol, symbol_id, time_open)
);

CREATE INDEX symbol_id_idx
  ON historical_ohlcv (symbol_id);

CREATE INDEX open_close_symbol_id
  ON historical_ohlcv (time_open, time_close, exchange_symbol, symbol_id);

CREATE INDEX time_open_idx
  ON historical_ohlcv (time_open);

CREATE INDEX time_close_idx
  ON historical_ohlcv (time_close);

该表当前有约2500万行。我的查询示例为1小时,但可能是5分钟,10分钟,2天等。

The table has ~25m rows currently. My query as an example for 1 hour, but could be 5 mins, 10 mins, 2 days, etc.

EXPLAIN ANALYZE WITH vals AS (
    SELECT
      NOW() - '5 months' :: INTERVAL AS frame_start,
      NOW() AS frame_end,
      INTERVAL '1 hour'        AS t_interval
)
  , grid AS (
      SELECT
        start_time,
        lead(start_time, 1)
        OVER (
          ORDER BY start_time ) AS end_time
      FROM (
             SELECT
               generate_series(frame_start, frame_end,
                               t_interval) AS start_time,
               frame_end
             FROM vals
           ) AS x
  )
SELECT max(high)
FROM grid g
  LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
WHERE exchange_symbol = 'BINANCE'
AND symbol_id = 'ETHBTC'
GROUP BY start_time;

WHERE子句可以是表中的任何有效值。

The WHERE clause could be any valid value in the table.

此技术的灵感来自于:

  • Best way to count records by arbitrary time intervals in Rails+Postgres.

这个想法是建立一个公用表,然后将数据与该表保持连接,以指示其中包含哪些存储桶中的东西。慢!目前需要15秒。根据查询计划器,我们有一个非常昂贵的嵌套循环:

The idea is to make a common table and left join your data with that to indicate which bucket stuff is in. This query is really slow! It's currently taking 15s. Based on the query planner, we have a really expensive nested loop:

QUERY PLAN
HashAggregate  (cost=2758432.05..2758434.05 rows=200 width=40) (actual time=16023.713..16023.817 rows=542 loops=1)
  Group Key: g.start_time
  CTE vals
    ->  Result  (cost=0.00..0.02 rows=1 width=32) (actual time=0.005..0.005 rows=1 loops=1)
  CTE grid
    ->  WindowAgg  (cost=64.86..82.36 rows=1000 width=16) (actual time=2.986..9.594 rows=3625 loops=1)
          ->  Sort  (cost=64.86..67.36 rows=1000 width=8) (actual time=2.981..4.014 rows=3625 loops=1)
                Sort Key: x.start_time
                Sort Method: quicksort  Memory: 266kB
                ->  Subquery Scan on x  (cost=0.00..15.03 rows=1000 width=8) (actual time=0.014..1.991 rows=3625 loops=1)
                      ->  ProjectSet  (cost=0.00..5.03 rows=1000 width=16) (actual time=0.013..1.048 rows=3625 loops=1)
                            ->  CTE Scan on vals  (cost=0.00..0.02 rows=1 width=32) (actual time=0.008..0.009 rows=1 loops=1)
  ->  Nested Loop  (cost=0.56..2694021.34 rows=12865667 width=14) (actual time=7051.730..16015.873 rows=31978 loops=1)
        ->  CTE Scan on grid g  (cost=0.00..20.00 rows=1000 width=16) (actual time=2.988..11.635 rows=3625 loops=1)
        ->  Index Scan using historical_ohlcv_pkey on historical_ohlcv ohlcv  (cost=0.56..2565.34 rows=12866 width=22) (actual time=3.712..4.413 rows=9 loops=3625)
              Index Cond: ((exchange_symbol = 'BINANCE'::text) AND (symbol_id = 'ETHBTC'::text) AND (time_open >= g.start_time))
              Filter: (time_close < g.end_time)
              Rows Removed by Filter: 15502
Planning time: 0.568 ms
Execution time: 16023.979 ms

我的猜测是该行的作用很大:

My guess is this line is doing a lot:

LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
                                AND ohlcv.time_close < g.end_time

但是我不确定如何用另一种方式来实现。

But I'm not sure how to accomplish this in another way.

PS抱歉,如果这属于dba.SE。我阅读了FAQ,对于该网站来说,这似乎太基本了,所以我在这里发布了。

P.S. apologies if this belongs to dba.SE. I read the FAQ and this seemed too basic for that site, so I posted here.

根据要求进行编辑:

从history_ohlcv t TABLESAMPLE SYSTEM(0.1)中选择avg(pg_column_size(t)); 返回107.632

SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1); returns 107.632

对于 exchange_symbol ,有3个唯一值,而 symbol_id 有〜400

For exchange_symbol, there are 3 unique values, for symbol_id there are ~400

PostgreSQL版本:x86_64-pc-linux-gnu上的PostgreSQL 10.3(Ubuntu 10.3-1.pgdg16.04 + 1),由gcc(Ubuntu 5.4.0-6ubuntu1〜16.04.9)编译5.4.0 20160609 ,即64位。

PostgreSQL version: PostgreSQL 10.3 (Ubuntu 10.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit.

该表每天将增长约100万条记录,因此不完全是只读的。所有这些工作都是在本地完成的,我将尝试移至RDS或帮助管理硬件问题。

The table will be growing about ~1m records a day, so not exactly read-only. All this stuff is done locally and I will try to move to RDS or to help manage hardware issues.

相关:如果我想添加其他聚合,特别是桶中的第一,桶中的最后,最小值,总和,我的索引编制策略是否会改变?

Related: if I wanted to add other aggregates, specifically 'first in the bucket', 'last in the bucket', min, sum, would my indexing strategy change?

推荐答案

首先要正确:我怀疑您的查询中存在错误:

Correctness first: I suspect a bug in your query:

 LEFT JOIN historical_ohlcv ohlcv ON ohlcv.time_open >= g.start_time
                                 AND ohlcv.time_close < g.end_time

不同于我的参考答案,您加入了时间时间间隔(time_open,time_close] 。这样做的方式是排除表中间隔超过存储桶边界的行。只有单个存储桶计数中完全包含的间隔。我不认为这是故意的吗?

Unlike my referenced answer, you join on a time interval: (time_open, time_close]. The way you do it excludes rows in the table where the interval crosses bucket borders. Only intervals fully contained in a single bucket count. I don't think that's intended?

一个简单的解决方法是根据 time_open (或 time_close $ b $如果只想同时使用两者,则必须精确定义 如何处理与多个存储桶重叠的时间间隔。

A simple fix would be to decide bucket membership based on time_open (or time_close) alone. If you want to keep working with both, you have to define exactly how to deal with intervals overlapping with multiple buckets.

此外,您正在寻找每个存储桶 max(high),这与 count(*)在我引用的答案中。

Also, you are looking for max(high) per bucket, which is different in nature from count(*) in my referenced answer.

您的存储桶是每小时简单的时间间隔?

And your buckets are simple intervals per hour?

然后我们可以从根本上简化整理。仅使用 time_open

Then we can radically simplify. Working with just time_open:

SELECT date_trunc('hour', time_open) AS hour, max(high) AS max_high
FROM   historical_ohlcv
WHERE  exchange_symbol = 'BINANCE'
AND    symbol_id = 'ETHBTC'
AND    time_open >= now() - interval '5 months'  -- frame_start
AND    time_open <  now()                        -- frame_end
GROUP  BY 1
ORDER  BY 1;

相关:

  • Resample on time series data

在基础尚不清楚的情况下,很难谈论进一步的性能优化。而且我们需要更多信息。

It's hard to talk about further performance optimization while basics are unclear. And we'd need more information.

WHERE 条件变量在哪里?

多少个 exchange_symbol symbol_id 中的不同值?

平均行大小?您会得到什么:

Are WHERE conditions variable?
How many distinct values in exchange_symbol and symbol_id?
Avg. row size? What do you get for:

SELECT avg(pg_column_size(t)) FROM historical_ohlcv t TABLESAMPLE SYSTEM (0.1);

表是只读的吗?

假设您始终对 exchange_symbol symbol_id 进行过滤值是可变的,您的表是只读的,或者autovacuum可以跟上写负载,因此我们希望进行仅索引扫描,您最好在上使用多列索引 (exchange_symbol,symbol_id,time_open,高DESC)以支持此查询。按此顺序索引列。相关:

Assuming you always filter on exchange_symbol and symbol_id and values are variable, your table is read-only or autovacuum can keep up with the write load so we can hope for index-only scans, you would best have a multicolumn index on (exchange_symbol, symbol_id, time_open, high DESC) to support this query. Index columns in this order. Related:

  • Multicolumn index and performance

取决于数据分布和其他详细信息a LEFT JOIN LATERAL 解决方案可能是另一种选择。相关:

Depending on data distribution and other details a LEFT JOIN LATERAL solution might be another option. Related:

  • How to find an average of values for time intervals in postgres
  • Optimize GROUP BY query to retrieve latest record per user

除此之外,您的 EXPLAIN 计划还显示了<强> 非常错误的估计:

Aside from all that, you EXPLAIN plan exhibits some very bad estimates:

  • https://explain.depesz.com/s/E5yI

您是否正在使用Postgres的最新版本?您可能需要进行服务器配置-或至少在相关列上设置更高的统计信息目标,并为大表设置更积极的自动清理设置。相关:

Are you using a current version of Postgres? You may have to work on your server configuration - or at least set higher statistics targets on relevant columns and more aggressive autovacuum settings for the big table. Related:

  • Keep PostgreSQL from sometimes choosing a bad query plan
  • Aggressive Autovacuum on PostgreSQL

这篇关于在CTE上以时间间隔缓慢左联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆