Postgres中非常慢的区分和排序方法 [英] Very slow distinct and sort method in Postgres

查看：89 发布时间：2020/5/30 1:52:50 sql postgresql

本文介绍了Postgres中非常慢的区分和排序方法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下视图： http://pastebin.com/jgLeM3cd 和数据库的大小大约10 GB。问题是由于 DISTINCT 导致视图执行非常非常缓慢。

I have the following view: http://pastebin.com/jgLeM3cd and the size of my database is about 10 GB. The problem is because of DISTINCT the view execution is really, really slow.

SELECT DISTINCT 
    users.id AS user_id, 
    contacts.id AS contact_id,
    contact_types.name AS relationship, 
    channels.name AS channel,
    feed_items.send_at AS sent_at, 
    feed_items.body AS message,
    feed_items.from_id, 
    feed_items.feed_id
FROM feed_items
JOIN channels ON feed_items.channel_id = channels.id
JOIN feeds ON feed_items.feed_id = feeds.id
JOIN contacts ON feeds.contact_id = contacts.id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
WHERE contacts.is_fake = false;

例如，下面是对 LIMIT 10的执行情况的分析： https://explain.depesz.com/s/K8q2

   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7717200.06..7717200.28 rows=10 width=1113) (actual time=118656.704..118656.726 rows=10 loops=1)
   ->  Unique  (cost=7717200.06..7780174.02 rows=2798843 width=1113) (actual time=118656.702..118656.723 rows=10 loops=1)
         ->  Sort  (cost=7717200.06..7724197.16 rows=2798843 width=1113) (actual time=118656.700..118656.712 rows=10 loops=1)
               Sort Key: users.id, contacts.id, contact_types.name, channels.name, feed_items.send_at, feed_items.body, feed_items.from_id, feed_items.feed_id
               Sort Method: external merge  Disk: 589888kB
               ->  Hash Join  (cost=22677.02..577531.86 rows=2798843 width=1113) (actual time=416.072..12918.259 rows=5301453 loops=1)
                     Hash Cond: (feed_items.channel_id = channels.id)
                     ->  Hash Join  (cost=22675.84..539046.59 rows=2798843 width=601) (actual time=416.052..10703.796 rows=5301636 loops=1)
                           Hash Cond: (contacts.contact_type_id = contact_types.id)
                           ->  Hash Join  (cost=22674.73..500479.61 rows=2820650 width=89) (actual time=416.038..8494.439 rows=5303074 loops=1)
                                 Hash Cond: (feed_items.feed_id = feeds.id)
                                 ->  Seq Scan on feed_items  (cost=0.00..223787.54 rows=6828254 width=77) (actual time=0.025..2300.762 rows=6820169 loops=1)
                                 ->  Hash  (cost=18314.88..18314.88 rows=250788 width=16) (actual time=415.830..415.830 rows=268669 loops=1)
                                       Buckets: 4096  Batches: 16  Memory Usage: 806kB
                                       ->  Hash Join  (cost=1642.22..18314.88 rows=250788 width=16) (actual time=19.562..337.146 rows=268669 loops=1)
                                             Hash Cond: (feeds.contact_id = contacts.id)
                                             ->  Seq Scan on feeds  (cost=0.00..11888.11 rows=607111 width=8) (actual time=0.013..116.339 rows=607117 loops=1)
                                             ->  Hash  (cost=1517.99..1517.99 rows=9938 width=12) (actual time=19.537..19.537 rows=9945 loops=1)
                                                   Buckets: 1024  Batches: 1  Memory Usage: 427kB
                                                   ->  Hash Join  (cost=619.65..1517.99 rows=9938 width=12) (actual time=5.743..16.746 rows=9945 loops=1)
                                                         Hash Cond: (contacts.user_id = users.id)
                                                         ->  Seq Scan on contacts  (cost=0.00..699.58 rows=9938 width=12) (actual time=0.005..5.981 rows=9945 loops=1)
                                                               Filter: (NOT is_fake)
                                                               Rows Removed by Filter: 14120
                                                         ->  Hash  (cost=473.18..473.18 rows=11718 width=4) (actual time=5.728..5.728 rows=11800 loops=1)
                                                               Buckets: 2048  Batches: 1  Memory Usage: 415kB
                                                               ->  Seq Scan on users  (cost=0.00..473.18 rows=11718 width=4) (actual time=0.004..2.915 rows=11800 loops=1)
                           ->  Hash  (cost=1.05..1.05 rows=5 width=520) (actual time=0.004..0.004 rows=5 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                 ->  Seq Scan on contact_types  (cost=0.00..1.05 rows=5 width=520) (actual time=0.002..0.003 rows=5 loops=1)
                     ->  Hash  (cost=1.08..1.08 rows=8 width=520) (actual time=0.012..0.012 rows=8 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 1kB
                           ->  Seq Scan on channels  (cost=0.00..1.08 rows=8 width=520) (actual time=0.006..0.007 rows=8 loops=1)
 Total runtime: 118765.513 ms
(34 rows)

我已经在几乎所有使用的列上创建了b树索引，除了 feed_items.body ，因为这是 text 列。我还增加了 work_mem ，但没有帮助。有什么想法可以加快速度吗？

I've created b-tree indexes on almost all columns that are used except feed_items.body because this is text column. I also increased work_mem but it didn't help. Any ideas how can I speed it up?

推荐答案

正如其他人在评论中所说：

As others said in comments:

使用 DISTINCT 并使用尽可能少的字段。

Use DISTINCT with as few as possible fields.

也许您只需要 GROUP BY ...

Maybe you only need a GROUP BY...

增加 work_mem 可以提供帮助，但这不是一个确定的解决方案（您的查询效率非常低，并且随着数据库的增长，它将再次降级...）

Increasing work_mem could help, but it is not a definitive solution (you have a very inefficient query and, as database will grow, it will degrade again...)

也：

索引对于像这样的大型扫描查询几乎无济于事这样：索引可以更快地获取具体结果，但是对索引进行全面扫描比对表（或联接）进行顺序扫描要昂贵得多。

Index could hardly help in large scan queries like this: Indexes can pick concrete results faster, but full scan on index is highly more expensive than a sequential scan over a table (or join).

唯一的例外是，您只需要选择一个大表的一些记录。但是计划人员几乎不会猜到它，因此您需要使用子查询或CTE（ WITH子句）来强制使用它。

The only exception to that is when you only need to pick a few records of a big table. But the planner will hardly guess it so you will need to force it by using a subquery or a CTE ("WITH" clause).

在同一行随着 work_mem 的增加，9.6版本的PostgreSQL具有并行扫描功能（必须首先手动启用）：如果您的服务器是该版本或您有机会对其进行升级，则它也可以加快响应速度时间（甚至无论如何，您的查询似乎都需要改进...;-））。

In the same line of work_mem increasing, 9.6 version of PostgreSQL comes with parallel scan capabilities (it must be enabled by hand first): If your server is that version or you have chance to upgrade it, it also could accelerate the response time (even, anyway, your query seems to need to be improved... ;-)).

所以，我的建议是尝试尽可能减少连接中涉及的数据。特别是在第一个连接中。也就是说：加入顺序很重要。请记住，（幸运的是）您没有任何左联接，因此每个联接实际上都是一个潜在的过滤器，因此首先选择较短的表（或您将选择较少行的表）会大大减少该联接所需的内存。

So, my recommendation is to try to reduce as much as possible the data involved in the join. And specially in the first joins. That is: the joining order does matter. Remember that (fortunately) you haven't any left joins, so each join is actually a potential filter, so picking first for shorter tables (or tables in which you will pick fewer rows) can considerably reduce memory needed for the join.

例如，（基于您的查询，未经测试并且完全记住，您的数据分布很重要）：

For example, (based on your query, not tested at all and REMEMBER, your data distribution matters):

SELECT DISTINCT
    users.id AS user_id,
    contacts.id AS contact_id,
    contact_types.name AS relationship,
    channels.name AS channel,
    feed_items.send_at AS sent_at,
    feed_items.body AS message,
    feed_items.from_id,
    feed_items.feed_id
-- Base your query in contacts because is the only place where you are making
-- some discardings:
FROM contacts
JOIN feeds ON (
    contacts.is_fake = false -- Filter here to reduce join size
    and feeds.contact_id = contacts.id -- Actual join condition
)
JOIN feed_items ON feed_items.feed_id = feeds.id
JOIN channels ON channels.id = feed_items.channel_id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
;

但是，再次：一切都取决于您的实际数据。

But, again: All depends on your actual data.

尝试一下，解释分析，找出最昂贵的零件，然后考虑改善它的策略。

Try it, EXPLAIN ANALYZE it, identify the most expensive parts, and think about strategies to improve it.

只是一些随机的想法，但我希望它能对您有所帮助。

That was only a few random ideas, but I hope it could help you a bit.

祝您好运！

这篇关于Postgres中非常慢的区分和排序方法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Postgres中非常慢的区分和排序方法 [英] Very slow distinct and sort method in Postgres

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Postgres中非常慢的区分和排序方法 [英] Very slow distinct and sort method in Postgres

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭