使用嵌套循环提高SQL查询的性能-PostgreSQL [英] Improve performance on SQL query with Nested Loop - PostgreSQL

查看:157
本文介绍了使用嵌套循环提高SQL查询的性能-PostgreSQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PostgreSQL ,并且我的SQL查询有一个奇怪的问题。 根据我使用的最晚日期参数。我的请求没有执行相同的操作。

I am using PostgreSQL and I have a weird problem with my SQL query. Depending on wich date paramter I'm using. My request doesn't do the same operation.

这是我的工作查询:

SELECT DISTINCT app.id_application 
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe 
LEFT OUTER JOIN application app ON app.id_application = gp.id_application 
WHERE date_stat >= '2016/3/01' 
AND date_stat <= '2016/3/31' 
AND ( date_stat = date_gen-1 or (date_gen = '2016/04/01' AND date_stat = '2016/3/31')) 
AND app.id_application IS NOT NULL 

此查询大约需要2秒的时间(对我来说这是正常的,因为我有很多行)。当我对此查询运行EXPLAIN ANALYZE时,我得到的是:

This query takes around 2 secondes (which is OKAY for me because I have a lots of rows). When I run EXPLAIN ANALYSE for this query I have this:

HashAggregate  (cost=375486.95..375493.62 rows=667 width=4) (actual time=2320.541..2320.656 rows=442 loops=1)
    ->  Hash Join  (cost=254.02..375478.99 rows=3186 width=4) (actual time=6.144..2271.984 rows=263274 loops=1)
    Hash Cond: (gp.id_application = app.id_application)
    ->  Hash Join  (cost=234.01..375415.17 rows=3186 width=4) (actual time=5.926..2200.671 rows=263274 loops=1)
          Hash Cond: (sj.id_groupe = gp.id_groupe)
          ->  Seq Scan on stat sj  (cost=0.00..375109.47 rows=3186 width=8) (actual time=3.196..2068.357 rows=263274 loops=1)
                Filter: ((date_stat >= '2016-03-01'::date) AND (date_stat <= '2016-03-31'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-04-01'::date) AND (date_stat = '2016-03-31'::date))))
                Rows Removed by Filter: 7199514
          ->  Hash  (cost=133.45..133.45 rows=8045 width=12) (actual time=2.677..2.677 rows=8019 loops=1)
                Buckets: 1024  Batches: 1  Memory Usage: 345kB
                ->  Seq Scan on groupe gp  (cost=0.00..133.45 rows=8045 width=12) (actual time=0.007..1.284 rows=8019 loops=1)
    ->  Hash  (cost=11.67..11.67 rows=667 width=4) (actual time=0.206..0.206 rows=692 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 25kB
          ->  Seq Scan on application app  (cost=0.00..11.67 rows=667 width=4) (actual time=0.007..0.101 rows=692 loops=1)
                Filter: (id_application IS NOT NULL)
    Total runtime: 2320.855 ms

现在,当我尝试对本月进行相同的查询时(我们是4月6日,所以我试图使用相同的查询获取所有4月的application_id

Now, When I'm trying the same query for the current month (we are the 6th of April, so I'm trying to get all the application_id of April) with the same query

SELECT DISTINCT app.id_application 
FROM stat sj
LEFT OUTER JOIN groupe gp ON gp.id_groupe = sj.id_groupe 
LEFT OUTER JOIN application app ON app.id_application = gp.id_application 
WHERE date_stat >= '2016/04/01' 
AND date_stat <= '2016/04/30' 
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_job = '2016/04/30')) 
AND app.id_application IS NOT NULL 

此查询现在需要120秒。因此,我还在此查询上运行了EXPLAIN ANALYZE,现在它没有相同的操作:

This query takes now 120 seconds. So I also ran EXPLAIN ANALYZE on this query and now it doesn't have the same operations:

HashAggregate  (cost=375363.50..375363.51 rows=1 width=4) (actual time=186716.468..186716.532 rows=490 loops=1)
->  Nested Loop  (cost=0.00..375363.49 rows=1 width=4) (actual time=1.945..186619.404 rows=118990 loops=1)
    Join Filter: (gp.id_application = app.id_application)
    Rows Removed by Join Filter: 82222090
    ->  Nested Loop  (cost=0.00..375343.49 rows=1 width=4) (actual time=1.821..171458.237 rows=118990 loops=1)
          Join Filter: (sj.id_groupe = gp.id_groupe)
          Rows Removed by Join Filter: 954061820
          ->  Seq Scan on stat sj  (cost=0.00..375109.47 rows=1 width=8) (actual time=0.235..1964.423 rows=118990 loops=1)
                Filter: ((date_stat >= '2016-04-01'::date) AND (date_stat <= '2016-04-30'::date) AND ((date_stat = (date_gen - 1)) OR ((date_gen = '2016-05-01'::date) AND (date_stat = '2016-04-30'::date))))
                Rows Removed by Filter: 7343798
          ->  Seq Scan on groupe gp  (cost=0.00..133.45 rows=8045 width=12) (actual time=0.002..0.736 rows=8019 loops=118990)
    ->  Seq Scan on application app  (cost=0.00..11.67 rows=667 width=4) (actual time=0.003..0.073 rows=692 loops=118990)
          Filter: (id_application IS NOT NULL)
  Total runtime: 186716.635 ms

所以我决定通过减少条件数来搜索问题出处

So I decided to search where the problem came from by reducing the number of conditions from my query until the performances is acceptable again.

因此仅使用此参数

WHERE date_stat >= '2016/04/01'

仅需1.9秒(例如第一个工作查询)
,并且还可以使用2个参数:

It takes only 1.9secondes (like the first working query) and it's also working with 2 parameters :

WHERE date_stat >= '2016/04/01' 
AND app.id_application IS NOT NULL 

但是当我尝试在解释中添加其中的一行我有嵌套循环

BUT when I try to add one of those line I have the Nested loop in the Explain

AND date_stat <= '2016/04/30' 
AND ( date_stat = date_gen-1 or ( date_gen = '2016/05/01' AND date_stat = '2016/04/30')) 

有人不知道它可能来自哪里吗?

Does someone have any idea where it could come from?

推荐答案

好的,看来优化器估算存在问题。他认为4月份将只有 1行,因此他选择了 NESTED LOOP ,这对于大量行(在这种情况下为 118,990 )。

Ok, it looks like there's problem with optimizer estimations. He thiks that for april there will be only 1 row so he choose NESTED LOOP which is very inefficient for big number of rows (118,990 in that case).


  1. 执行每个表的VACUUM ANALYZE 。这将清除死元组并刷新统计信息。

  2. 考虑基于日期添加索引,例如 CREATE INDEX date_stat_idx开启<带有date_stat的表>使用btree(date_stat);

  1. Perform VACUUM ANALYZE for every table. This will clean up dead tuples and refresh statistics.
  2. consider adding index based on dates like CREATE INDEX date_stat_idx ON <table with date_stat> USING btree (date_stat);

重新运行查询,

这篇关于使用嵌套循环提高SQL查询的性能-PostgreSQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆