SELECT DISTINCT比我在PostgreSQL表中的预期速度慢 [英] SELECT DISTINCT is slower than expected on my table in PostgreSQL

查看:161
本文介绍了SELECT DISTINCT比我在PostgreSQL表中的预期速度慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的表架构:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

我的应用程序在"ticker"上订阅了Coinbase Pro的websocket.渠道,并在收到消息时在代码表中插入一行.

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

该表现在有近200万行.

The table has nearly two million rows now.

我认为运行 SELECT DISTINCT product_id FROM tickers 会很快,但是大约需要500到600毫秒.这是 EXPLAIN ANALYZE 的输出:

I assumed that running SELECT DISTINCT product_id FROM tickers would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE:

HashAggregate  (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
  Group Key: product_id
  ->  Seq Scan on tickers  (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms

如果我通过运行 SET enable_seqscan = FALSE 关闭seq扫描(不是我真正想依靠的东西,只是为了测试目的而做),那么查询会快一点.在400到500毫秒之间.这是 EXPLAIN ANALYZE 的输出:

If I turn off seq scanning by running SET enable_seqscan = FALSE (not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE:

Unique  (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
  ->  Index Only Scan using tickers_pkey on tickers  (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
        Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms

表中只有40个唯一的产品ID.我假设由于 product_id 是组合主键的一部分,因此被索引了,因此 SELECT DISTINCT product_id FROM tickers 会更快.但是事实证明,查询计划程序默认使用seq扫描而不是索引,即使我强迫它使用索引,它仍然很慢(但比seq扫描快一点).我意识到我可以创建另一个表来存储唯一的产品ID并进行查询,但是我更关心为何我对代码表的查询花费如此长时间的原因.

There are only 40 unique product IDs in the table. I assumed that since product_id is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.

编辑#1:我尝试仅在product_id列( CREATE INDEX idx_tickers_product_id ON代码(product_id))上创建索引,并且查询计划程序仍会进行顺序扫描,除非运行 SET enable_seqscan = FALSE 第一的.但是它的性能比使用复合PK索引时要好一些(快10到50毫秒).

EDIT #1: I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.

编辑#2:我尝试了Erwin Brandstetter的解决方案,它极大地提高了速度.表中现在有225万行,执行只需要0.75毫秒!

EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!

编辑#3:我想增加接受的解决方案,以便检索代码数量(max(trade_id)-min(trade_id)+ 1)以及每个产品ID的最小和最大时间.我为此创建了一个新问题:如何使用索引跳过PostgreSQL中的仿真以获取不同的产品ID,以及某些列的最小值/最大值

EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns

推荐答案

虽然Postgres中还没有索引跳过扫描,但可以模拟它:

While there is no index skip scan in Postgres yet, emulate it:

WITH RECURSIVE cte AS (
   (   -- parentheses required
   SELECT product_id
   FROM   tickers
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT l.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT product_id
      FROM   tickers t
      WHERE  t.product_id > c.product_id  -- lateral reference
      ORDER  BY 1
      LIMIT  1
      ) l
   )
TABLE  cte;

(product_id)上有一个索引,并且表中的仅40个唯一产品ID 应该是快速.用大写字母 F .
(product_id,trade_id)上的PK索引也很合适!

With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.
The PK index on (product_id, trade_id) is good for it, too!

每个 product_id 仅有很少的行(与数据分布相反), DISTINCT / DISTINCT ON

With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

正在进行索引跳过扫描的工作.
参见:

Work to implement index skip scans is ongoing.
See:

这篇关于SELECT DISTINCT比我在PostgreSQL表中的预期速度慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆