Postgres中非常慢的区分和排序方法 [英] Very slow distinct and sort method in Postgres

查看:89
本文介绍了Postgres中非常慢的区分和排序方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下视图: http://pastebin.com/jgLeM3cd 和数据库的大小大约10 GB。问题是由于 DISTINCT 导致视图执行非常非常缓慢。

I have the following view: http://pastebin.com/jgLeM3cd and the size of my database is about 10 GB. The problem is because of DISTINCT the view execution is really, really slow.

SELECT DISTINCT 
    users.id AS user_id, 
    contacts.id AS contact_id,
    contact_types.name AS relationship, 
    channels.name AS channel,
    feed_items.send_at AS sent_at, 
    feed_items.body AS message,
    feed_items.from_id, 
    feed_items.feed_id
FROM feed_items
JOIN channels ON feed_items.channel_id = channels.id
JOIN feeds ON feed_items.feed_id = feeds.id
JOIN contacts ON feeds.contact_id = contacts.id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
WHERE contacts.is_fake = false;

例如,下面是对 LIMIT 10的执行情况的分析 https://explain.depesz.com/s/K8q2

   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7717200.06..7717200.28 rows=10 width=1113) (actual time=118656.704..118656.726 rows=10 loops=1)
   ->  Unique  (cost=7717200.06..7780174.02 rows=2798843 width=1113) (actual time=118656.702..118656.723 rows=10 loops=1)
         ->  Sort  (cost=7717200.06..7724197.16 rows=2798843 width=1113) (actual time=118656.700..118656.712 rows=10 loops=1)
               Sort Key: users.id, contacts.id, contact_types.name, channels.name, feed_items.send_at, feed_items.body, feed_items.from_id, feed_items.feed_id
               Sort Method: external merge  Disk: 589888kB
               ->  Hash Join  (cost=22677.02..577531.86 rows=2798843 width=1113) (actual time=416.072..12918.259 rows=5301453 loops=1)
                     Hash Cond: (feed_items.channel_id = channels.id)
                     ->  Hash Join  (cost=22675.84..539046.59 rows=2798843 width=601) (actual time=416.052..10703.796 rows=5301636 loops=1)
                           Hash Cond: (contacts.contact_type_id = contact_types.id)
                           ->  Hash Join  (cost=22674.73..500479.61 rows=2820650 width=89) (actual time=416.038..8494.439 rows=5303074 loops=1)
                                 Hash Cond: (feed_items.feed_id = feeds.id)
                                 ->  Seq Scan on feed_items  (cost=0.00..223787.54 rows=6828254 width=77) (actual time=0.025..2300.762 rows=6820169 loops=1)
                                 ->  Hash  (cost=18314.88..18314.88 rows=250788 width=16) (actual time=415.830..415.830 rows=268669 loops=1)
                                       Buckets: 4096  Batches: 16  Memory Usage: 806kB
                                       ->  Hash Join  (cost=1642.22..18314.88 rows=250788 width=16) (actual time=19.562..337.146 rows=268669 loops=1)
                                             Hash Cond: (feeds.contact_id = contacts.id)
                                             ->  Seq Scan on feeds  (cost=0.00..11888.11 rows=607111 width=8) (actual time=0.013..116.339 rows=607117 loops=1)
                                             ->  Hash  (cost=1517.99..1517.99 rows=9938 width=12) (actual time=19.537..19.537 rows=9945 loops=1)
                                                   Buckets: 1024  Batches: 1  Memory Usage: 427kB
                                                   ->  Hash Join  (cost=619.65..1517.99 rows=9938 width=12) (actual time=5.743..16.746 rows=9945 loops=1)
                                                         Hash Cond: (contacts.user_id = users.id)
                                                         ->  Seq Scan on contacts  (cost=0.00..699.58 rows=9938 width=12) (actual time=0.005..5.981 rows=9945 loops=1)
                                                               Filter: (NOT is_fake)
                                                               Rows Removed by Filter: 14120
                                                         ->  Hash  (cost=473.18..473.18 rows=11718 width=4) (actual time=5.728..5.728 rows=11800 loops=1)
                                                               Buckets: 2048  Batches: 1  Memory Usage: 415kB
                                                               ->  Seq Scan on users  (cost=0.00..473.18 rows=11718 width=4) (actual time=0.004..2.915 rows=11800 loops=1)
                           ->  Hash  (cost=1.05..1.05 rows=5 width=520) (actual time=0.004..0.004 rows=5 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                 ->  Seq Scan on contact_types  (cost=0.00..1.05 rows=5 width=520) (actual time=0.002..0.003 rows=5 loops=1)
                     ->  Hash  (cost=1.08..1.08 rows=8 width=520) (actual time=0.012..0.012 rows=8 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 1kB
                           ->  Seq Scan on channels  (cost=0.00..1.08 rows=8 width=520) (actual time=0.006..0.007 rows=8 loops=1)
 Total runtime: 118765.513 ms
(34 rows)

我已经在几乎所有使用的列上创建了b树索引,除了 feed_items.body ,因为这是 text 列。我还增加了 work_mem ,但没有帮助。有什么想法可以加快速度吗?

I've created b-tree indexes on almost all columns that are used except feed_items.body because this is text column. I also increased work_mem but it didn't help. Any ideas how can I speed it up?

推荐答案

正如其他人在评论中所说:

As others said in comments:


  • 使用 DISTINCT 并使用尽可能少的字段。

  • Use DISTINCT with as few as possible fields.


  • 也许您只需要 GROUP BY ...

  • Maybe you only need a GROUP BY...

增加 work_mem 可以提供帮助,但这不是一个确定的解决方案(您的查询效率非常低,并且随着数据库的增长,它将再次降级...)

Increasing work_mem could help, but it is not a definitive solution (you have a very inefficient query and, as database will grow, it will degrade again...)

也:


  • 索引对于像这样的大型扫描查询几乎无济于事这样:索引可以更快地获取具体结果,但是对索引进行全面扫描比对表(或联接)进行顺序扫描要昂贵得多。

  • Index could hardly help in large scan queries like this: Indexes can pick concrete results faster, but full scan on index is highly more expensive than a sequential scan over a table (or join).

唯一的例外是,您只需要选择一个大表的一些记录。但是计划人员几乎不会猜到它,因此您需要使用子查询或CTE( WITH子句)来强制使用它。

The only exception to that is when you only need to pick a few records of a big table. But the planner will hardly guess it so you will need to force it by using a subquery or a CTE ("WITH" clause).

在同一行随着 work_mem 的增加,9.6版本的PostgreSQL具有并行扫描功能(必须首先手动启用):如果您的服务器是该版本或您有机会对其进行升级,则它也可以加快响应速度时间(甚至无论如何,您的查询似乎都需要改进...;-))。

In the same line of work_mem increasing, 9.6 version of PostgreSQL comes with parallel scan capabilities (it must be enabled by hand first): If your server is that version or you have chance to upgrade it, it also could accelerate the response time (even, anyway, your query seems to need to be improved... ;-)).

所以,我的建议是尝试尽可能减少连接中涉及的数据。特别是在第一个连接中。也就是说:加入顺序很重要。请记住,(幸运的是)您没有任何左联接,因此每个联接实际上都是一个潜在的过滤器,因此首先选择较短的表(或您将选择较少行的表)会大大减少该联接所需的内存。

So, my recommendation is to try to reduce as much as possible the data involved in the join. And specially in the first joins. That is: the joining order does matter. Remember that (fortunately) you haven't any left joins, so each join is actually a potential filter, so picking first for shorter tables (or tables in which you will pick fewer rows) can considerably reduce memory needed for the join.

例如,(基于您的查询,未经测试并且完全记住,您的数据分布很重要):

For example, (based on your query, not tested at all and REMEMBER, your data distribution matters):

SELECT DISTINCT
    users.id AS user_id,
    contacts.id AS contact_id,
    contact_types.name AS relationship,
    channels.name AS channel,
    feed_items.send_at AS sent_at,
    feed_items.body AS message,
    feed_items.from_id,
    feed_items.feed_id
-- Base your query in contacts because is the only place where you are making
-- some discardings:
FROM contacts
JOIN feeds ON (
    contacts.is_fake = false -- Filter here to reduce join size
    and feeds.contact_id = contacts.id -- Actual join condition
)
JOIN feed_items ON feed_items.feed_id = feeds.id
JOIN channels ON channels.id = feed_items.channel_id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
;

但是,再次:一切都取决于您的实际数据。

But, again: All depends on your actual data.

尝试一下,解释分析,找出最昂贵的零件,然后考虑改善它的策略。

Try it, EXPLAIN ANALYZE it, identify the most expensive parts, and think about strategies to improve it.

只是一些随机的想法,但我希望它能对您有所帮助。

That was only a few random ideas, but I hope it could help you a bit.

祝您好运!

这篇关于Postgres中非常慢的区分和排序方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆