提高PostgresSQL聚合查询性能 [英] Improve PostgresSQL aggregation query performance

查看:169
本文介绍了提高PostgresSQL聚合查询性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从Postgres表中聚合数据,查询大约需要2秒钟,我想减少到不到一秒钟。



请在下面找到执行详细信息:






查询

 选择
a.search_keyword,
hll_cardinality(hll_union_agg(a.users )):: int as user_count,
hll_cardinality(hll_union_agg(a.sessions)):: int as session_count,
sum(a.total)as keyword_count
from
rollup_day a
,其中
a.created_date在'2018-09-01'和'2019-09-30'之间
和a.tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885'
组由
a.search_keyword
订单由
session_count desc
限制100;






表元数据


  1. 总行数- 506527

  2. 列的综合索引: tenant_id和created_date






查询计划

 自定义扫描(成本= 0.00..0.00行= 0宽度= 0)(实际时间= 1722.685..1722.694行= 100循环= 1)
任务计数:1
显示的任务:全部
->任务
节点:host = localhost端口= 5454 dbname = postgres
->限制(cost = 64250.24..64250.49行= 100宽度= 42)(实际时间= 1783.087..1783.106行= 100循环= 1)
->排序(cost = 64250.24..64558.81行= 123430宽度= 42)(实际时间= 1783.085..1783.093行= 100循环= 1)
排序键:((hll_cardinality(hll_union_agg(sessions))):: integer )DESC
排序方法:top-N堆排序内存:33kB
-> GroupAggregate(费用= 52933.89..59532.83行= 123430宽度= 42)(实际时间= 905.502..1724.363行= 212633循环= 1)
组关键字:search_keyword
->排序(cost = 52933.89..53636.53行= 281055宽度= 54)(实际时间= 905.483..1351.212行= 280981循环= 1)
排序键:search_keyword
排序方法:外部合并磁盘:18496kB
->在rollup_day a上进行Seq扫描(成本= 0.00..17890.22行= 281055宽度= 54)(实际时间= 29.720..112.161行= 280981循环= 1)
过滤器:((created_date> ='2018-09 -01':: date)AND(created_date< ='2019-09-30':: date)AND(tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885':: uuid))
已删除行通过筛选器:225546
计划时间:0.129 ms
执行时间:1786.222 ms
计划时间:0.103 ms
执行时间:1722.718 ms






我尝试过的事情


  1. 我尝试使用 tenant_id和created_date 上的索引,但是由于数据量很大,因此它总是在进行顺序扫描,而不是对过滤器进行索引扫描。我已阅读并发现,如果返回的数据占总行的 5-10%,则Postgres查询引擎会切换到顺序扫描。请点击链接以获取更多参考

  2. 我已将 work_mem 增加到 100MB ,但它只提高了一点性能。

任何帮助将不胜感激。






在将 work_mem设置为100MB后,更新


查询计划/ blockquote>

 自定义扫描(成本= 0.00..0.00行= 0宽度= 0)(实际时间= 1375.926..1375.935行= 100 loops = 1)
任务计数:1
显示的任务:全部
->任务
节点:host = localhost端口= 5454 dbname = postgres
->限制(成本= 48348.85..48349.10行= 100宽度= 42)(实际时间= 1307.072..1307.093行= 100循环= 1)
->排序(cost = 48348.85..48633.55行= 113880宽度= 42)(实际时间= 1307.071..1307.080行= 100循环= 1)
排序键:(sum(total))DESC
排序方法:前N个堆排序内存:35kB
-> GroupAggregate(费用= 38285.79..43996.44行= 113880宽度= 42)(实际时间= 941.504..1261.177行= 172945循环= 1)
组关键字:search_keyword
->排序(cost = 38285.79..38858.52行= 229092宽度= 54)(实际时间= 941.484..963.061行= 227261循环= 1)
排序键:search_keyword
排序方法:quicksort内存:32982kB
->在rollup_day_104290 a(成本= 0.00..17890.22行= 229092宽度= 54)上进行序列扫描(实际时间= 38.803..104.350行= 227261循环= 1)
过滤器:((created_date> ='2019-01 -01':: date)AND(created_date< ='2019-12-30':: date)AND(tenant_id ='62850a62-19ac-477d-9cd7-837f3d716885':: uuid))
已删除行通过过滤器:279266
计划时间:0.131 ms
执行时间:1308.814 ms
计划时间:0.112 ms
执行时间:1375.961 ms






更新2


在created_date创建索引后,将 work_mem增加到120MB


在rollup_day(created_date)上创建索引date_idx;



行是: 12,124,608



查询计划是:

 自定义扫描(成本= 0.00..0.00行= 0宽度= 0)(实际时间= 2635.530..2635.540行= 100循环= 1)
任务计数:1
显示的任务:全部
->任务
节点:host = localhost端口= 9702 dbname = postgres
->限制(费用= 73545.19..73545.44行= 100宽度= 51)(实际时间= 2755.849..2755.873行= 100循环= 1)
->排序(cost = 73545.19..73911.25行= 146424宽度= 51)(实际时间= 2755.847..2755.858行= 100循环= 1)
排序键:(sum(total))DESC
排序方法:前N个堆排序内存:35kB
-> GroupAggregate(费用= 59173.97..67948.97行= 146424宽度= 51)(实际时间= 2014.260..2670.732行= 296537循环= 1)
组关键字:search_keyword
->排序(成本= 59173.97..60196.85行= 409152宽度= 55)(实际时间= 2013.885..2064.775行= 410618循环= 1)
排序键:search_keyword
排序方法:quicksort内存:61381kB
->使用上rollup_day_102913 a上的date_idx_102913进行索引扫描(成本= 0.42..21036.35行= 409152宽度= 55)(实际时间= 0.026..183.370行= 410618循环= 1)
索引条件:(((created_date> =' 2018-01-01':: date)AND(created_date< ='2018-12-31':: date))
过滤器:(tenant_id ='12850a62-19ac-477d-9cd7-837f3d716885':: uuid)
计划时间:0.135毫秒
执行时间:2760.667毫秒
计划时间:0.090毫秒
执行时间:2635.568毫秒


解决方案

您应该尝试使用 work_mem 的更高设置获得内存中的排序。当然,只有在计算机上有足够的内存时,您才可以拥有足够的内存。



如果您存储预先聚合的数据,则可以通过以下两种方式来加快查询速度:物化视图或第二个表以及原始表上的触发器,该触发器将使另一个表中的总和保持更新。我不知道您的数据是否可行,因为我不知道 hll_cardinality hll_union_agg 是什么


I am aggregating data from a Postgres table, the query is taking approx 2 seconds which I want to reduce to less than a second.

Please find below the execution details:


Query

select
    a.search_keyword,
    hll_cardinality( hll_union_agg(a.users) ):: int as user_count,
    hll_cardinality( hll_union_agg(a.sessions) ):: int as session_count,
    sum(a.total) as keyword_count
from
    rollup_day a
where
    a.created_date between '2018-09-01' and '2019-09-30'
    and a.tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'
group by
    a.search_keyword
order by
    session_count desc
limit 100;


Table metadata

  1. Total number of rows - 506527
  2. Composite Index on columns : tenant_id and created_date


Query plan

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1722.685..1722.694 rows=100 loops=1)
  Task Count: 1
  Tasks Shown: All
  ->  Task
        Node: host=localhost port=5454 dbname=postgres
        ->  Limit  (cost=64250.24..64250.49 rows=100 width=42) (actual time=1783.087..1783.106 rows=100 loops=1)
              ->  Sort  (cost=64250.24..64558.81 rows=123430 width=42) (actual time=1783.085..1783.093 rows=100 loops=1)
                    Sort Key: ((hll_cardinality(hll_union_agg(sessions)))::integer) DESC
                    Sort Method: top-N heapsort  Memory: 33kB
                    ->  GroupAggregate  (cost=52933.89..59532.83 rows=123430 width=42) (actual time=905.502..1724.363 rows=212633 loops=1)
                          Group Key: search_keyword
                          ->  Sort  (cost=52933.89..53636.53 rows=281055 width=54) (actual time=905.483..1351.212 rows=280981 loops=1)
                                Sort Key: search_keyword
                                Sort Method: external merge  Disk: 18496kB
                                ->  Seq Scan on rollup_day a  (cost=0.00..17890.22 rows=281055 width=54) (actual time=29.720..112.161 rows=280981 loops=1)
                                      Filter: ((created_date >= '2018-09-01'::date) AND (created_date <= '2019-09-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
                                      Rows Removed by Filter: 225546
            Planning Time: 0.129 ms
            Execution Time: 1786.222 ms
Planning Time: 0.103 ms
Execution Time: 1722.718 ms


What I've tried

  1. I've tried with indexes on tenant_id and created_date but as the data is huge so it's always doing sequence scan rather than an index scan for filters. I've read about it and found, the Postgres query engine switch to sequence scan if the data returned is > 5-10% of the total rows. Please follow the link for more reference.
  2. I've increased the work_mem to 100MB but it only improved the performance a little bit.

Any help would be really appreciated.


Update

Query plan after setting work_mem to 100MB

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=1375.926..1375.935 rows=100 loops=1)
  Task Count: 1
  Tasks Shown: All
  ->  Task
        Node: host=localhost port=5454 dbname=postgres
        ->  Limit  (cost=48348.85..48349.10 rows=100 width=42) (actual time=1307.072..1307.093 rows=100 loops=1)
              ->  Sort  (cost=48348.85..48633.55 rows=113880 width=42) (actual time=1307.071..1307.080 rows=100 loops=1)
                    Sort Key: (sum(total)) DESC
                    Sort Method: top-N heapsort  Memory: 35kB
                    ->  GroupAggregate  (cost=38285.79..43996.44 rows=113880 width=42) (actual time=941.504..1261.177 rows=172945 loops=1)
                          Group Key: search_keyword
                          ->  Sort  (cost=38285.79..38858.52 rows=229092 width=54) (actual time=941.484..963.061 rows=227261 loops=1)
                                Sort Key: search_keyword
                                Sort Method: quicksort  Memory: 32982kB
                                ->  Seq Scan on rollup_day_104290 a  (cost=0.00..17890.22 rows=229092 width=54) (actual time=38.803..104.350 rows=227261 loops=1)
                                      Filter: ((created_date >= '2019-01-01'::date) AND (created_date <= '2019-12-30'::date) AND (tenant_id = '62850a62-19ac-477d-9cd7-837f3d716885'::uuid))
                                      Rows Removed by Filter: 279266
            Planning Time: 0.131 ms
            Execution Time: 1308.814 ms
Planning Time: 0.112 ms
Execution Time: 1375.961 ms


Update 2

After creating an index on created_date and increased work_mem to 120MB

create index date_idx on rollup_day(created_date);

The total number of rows is: 12,124,608

Query Plan is:

Custom Scan (cost=0.00..0.00 rows=0 width=0) (actual time=2635.530..2635.540 rows=100 loops=1)
  Task Count: 1
  Tasks Shown: All
  ->  Task
        Node: host=localhost port=9702 dbname=postgres
        ->  Limit  (cost=73545.19..73545.44 rows=100 width=51) (actual time=2755.849..2755.873 rows=100 loops=1)
              ->  Sort  (cost=73545.19..73911.25 rows=146424 width=51) (actual time=2755.847..2755.858 rows=100 loops=1)
                    Sort Key: (sum(total)) DESC
                    Sort Method: top-N heapsort  Memory: 35kB
                    ->  GroupAggregate  (cost=59173.97..67948.97 rows=146424 width=51) (actual time=2014.260..2670.732 rows=296537 loops=1)
                          Group Key: search_keyword
                          ->  Sort  (cost=59173.97..60196.85 rows=409152 width=55) (actual time=2013.885..2064.775 rows=410618 loops=1)
                                Sort Key: search_keyword
                                Sort Method: quicksort  Memory: 61381kB
                                ->  Index Scan using date_idx_102913 on rollup_day_102913 a  (cost=0.42..21036.35 rows=409152 width=55) (actual time=0.026..183.370 rows=410618 loops=1)
                                      Index Cond: ((created_date >= '2018-01-01'::date) AND (created_date <= '2018-12-31'::date))
                                      Filter: (tenant_id = '12850a62-19ac-477d-9cd7-837f3d716885'::uuid)
            Planning Time: 0.135 ms
            Execution Time: 2760.667 ms
Planning Time: 0.090 ms
Execution Time: 2635.568 ms

解决方案

You should experiment with higher settings of work_mem until you get an in-memory sort. Of course you can only be generous with memory if your machine has enough of it.

What would make your query way faster is if you store pre-aggregated data, either using a materialized view or a second table and a trigger on your original table that keeps the sums in the other table updated. I don't know if that is possible with your data, as I don't know what hll_cardinality and hll_union_agg are.

这篇关于提高PostgresSQL聚合查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆