有什么方法可以加快此Postgres位图堆扫描的速度吗? [英] Is there any way to speed up this Postgres bitmap heap scan?

查看:99
本文介绍了有什么方法可以加快此Postgres位图堆扫描的速度吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据库新手在这里。这是我的查询,我使用的是Postgres 9.3.5:

  =#解释分析SELECT SUM(actual_cost)作为成本, SUM(total_items)
作为num_items,processing_date来自frontend_items
WHERE chemical_id ='0501013B0'GROUP BY processing_date;

这是查询计划:

  HashAggregate(成本= 1648624.91..1648624.92行= 1宽度= 16)(实际时间= 12591.844..12591.848行= 17循环= 1)
->在frontend_items上进行位图堆扫描(cost = 14520.24..1643821.35行= 640474宽度= 16)(实际时间= 254.841..12317.746行= 724242循环= 1)
重新检查条件:((chemical_id):: text =' 0501013B0':: text)
->在frontend_items_chemical_id_varchar_pattern_ops_idx上进行位图索引扫描(成本= 0.00..14360.12行= 640474宽度= 0)(实际时间= 209.538..209.538行= 724242循环= 1)
Index Cond:((chemical_id):: text =' 0501013B0':: text)
总运行时间:12592.499毫秒

您可以看到, 位图堆扫描占用的大部分时间。有什么方法可以加快速度吗?



如果需要,我可以创建更多索引:我的数据几乎是只读的(每月更新一次)。



考虑到我想要多个属性,除了要支付足够的RAM来将整个数据库保存在内存中之外,我想我没什么能做的,但是建议将不胜感激。



如果可以加快速度,那么我可能一次只查找其中一个属性。



NB:我正在具有16GB RAM和SSD的Macbook上运行此程序。我已将 shared_buffers 设置为4GB,并将 work_mem 设置为40MB。我最终将使用具有32GB RAM和SSD的服务器。



更新:表模式如下:

 列|类型修饰符
------------------- + ------------------------- + ------------------------------------------------- -------------------
id |整数|不为空默认nextval(’frontend_items_id_seq’:: regclass)
presentation_code |角色变化(15)|不为空
presentation_name |字符变化(1000)|不为空
total_items |整数|不为空
net_cost |双精度|不为空
Actual_cost |双精度|
数量不为空|双精度|不为空
processing_date |日期|不为空
price_per_unit |双精度|不为空
chemical_id |角色变化(9)|不为空
pct_id |角色变化(3)|不为空
practice_id |角色变化(6)|不为空
sha_id |角色变化(3)|不为空
索引:
frontend_items_pkey主键,btree(id)
frontend_items_45fff4c7 btree(sha_id)
frontend_items_4e2e609b btree(pct_id)
frontend_items_528 btree(处理日期)
frontend_items_6ea07fe3 btree(实践_id)
frontend_items_a69d813a btree(chemical_id)
frontend_items_b9b2c7ab btree(presentation_code)
front__ternms_chemical_pat_id
frontend_items_pct_code_id_488a8bbfb2bddc6d_like b树(pct_id varchar_pattern_ops)
frontend_items_practice_id_bbbafffdb2c2bf1_like b树(practice_id varchar_pattern_ops)
frontend_items_presentation_code_69403ee04fda6522_like b树(presentation_code varchar_pattern_ops)
frontend_items_presentation_code_varchar_pattern_ops_idx b树(presentation_code varchar_pattern_ops)
外键约束:
front_chemical_id_4619f68f65c4 9a8_fk_frontend_chemical_bnf_code外键(chemical_id)参考frontend_chemical(bnf_code)DEFERRABLE INITIALLY DEFERRED
的frontend__practice_id_bbbafffdb2c2bf1_fk_frontend_practice_code 外键(practice_id)参考frontend_practice(代码)DEFERRABLE INITIALLY DEFERRED
的frontend_items_pct_id_30c06df242c3d1ba_fk_frontend_pct_code 外键(pct_id)参考frontend_pct(代码)可延迟初始设置
frontend_items_sha_id_4fa0ca3c3b9b67f_fk_frontend_sha_code外键(sha_id)参考frontend_sha(代码)可延迟初始设置

这是详细说明的输出:

 #说明(详细,缓冲区,分析)SELECT SUM(actual_cost )作为费用,SUM(total_items)作为num_items,processing_date FROM frontend_items WHERE chemical_id ='0501012G0'GROUP BY processing_date; 



查询计划
---------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------
HashAggregate(成本= 1415349.73..1415349.74行= 1宽度= 16)(实际时间= 3048.551..3048.556行= 17循环= 1)
输出:sum(实际费用),总和(总计项目),处理日期
缓冲区:共享命中= 141958读取= 12725
->对public.frontend_items进行位图堆扫描(cost = 11797.55..1411446.84行= 520385宽度= 16)(实际时间= 213.889..2834.911行= 524644循环= 1)
输出:id,presentation_code,presentation_name,total_items,净成本,实际成本,数量,处理日期,每单位价格,chemical_id,pct_id,practice_id,sha_id
重新检查条件:((frontend_items.chemical_id):: text ='0501012G0':: text)
缓冲区:共享命中= 141958 read = 12725
->在frontend_items_chemical_id_varchar_pattern_ops_idx上进行位图索引扫描(成本= 0.00..11667.46行= 520385宽度= 0)(实际时间= 172.574..172.574行= 524644循环= 1)
Index Cond:((frontend_items.chemical_id):: text ='0501012G0':: text)
缓冲区:共享hit = 2读取= 2012
总运行时间:3049.177 ms


解决方案

您有724242行,查询需要12592.499毫秒。这是每行0.017387毫秒,即每秒57514行。你在抱怨什么我认为您的查询速度很快。尽管位图索引/堆扫描更快,但普通HDD通过使用索引扫描仅支持65-200行/秒的速率。我想您会发现PostgreSQL正在使用适合您情况的最佳查询计划。



如果再次执行查询,查询速度会更快吗?那时缓存将很热,因此重复执行可能会更快。如果速度变慢,那么更多的内存将不太可能提供帮助。 PostgreSQL的数据页大小为8 KB,因此您最多可以访问724242 * 8 KB = 5.5 GB的数据,即数据应该适合您的RAM。



编辑:问题的编辑版本中提到的第二个查询显示每秒172000行的性能。因此,如果将数据缓存在RAM中,则这样的查询可能会变得更快。我会选择在RAM中拟合整个数据集的方法。 RAM很便宜,但开发人员时间却很昂贵。


Database newbie here. This is my query, I'm using Postgres 9.3.5:

=# explain analyse SELECT SUM(actual_cost) as cost, SUM(total_items) 
   as num_items, processing_date FROM frontend_items 
   WHERE chemical_id='0501013B0' GROUP BY processing_date;

And this is the query plan:

HashAggregate  (cost=1648624.91..1648624.92 rows=1 width=16) (actual time=12591.844..12591.848 rows=17 loops=1)
   ->  Bitmap Heap Scan on frontend_items  (cost=14520.24..1643821.35 rows=640474 width=16) (actual time=254.841..12317.746 rows=724242 loops=1)
         Recheck Cond: ((chemical_id)::text = '0501013B0'::text)
         ->  Bitmap Index Scan on frontend_items_chemical_id_varchar_pattern_ops_idx  (cost=0.00..14360.12 rows=640474 width=0) (actual time=209.538..209.538 rows=724242 loops=1)
               Index Cond: ((chemical_id)::text = '0501013B0'::text)
 Total runtime: 12592.499 ms

As you can see, it's the Bitmap Heap Scan that takes up most of the time. Is there any way to speed this up?

I can create more indexes if needed: my data is almost read-only (it updates once a month).

I'm guessing there isn't much I can do, given that I want multiple attributes, except pay for enough RAM to hold the entire database in memory, but suggestions would be very much appreciated.

It's possible I could just look up one of these attributes at a time, if that would speed things up.

NB: I'm running this on a Macbook with 16GB of RAM, and an SSD. I have set shared_buffers to 4GB and work_mem to 40MB. I will eventually be using a server with 32GB of RAM and an SSD.

UPDATE: The table schema is as follows:

     Column       |          Type           |                             Modifiers
-------------------+-------------------------+--------------------------------------------------------------------
 id                | integer                 | not null default nextval('frontend_items_id_seq'::regclass)
 presentation_code | character varying(15)   | not null
 presentation_name | character varying(1000) | not null
 total_items       | integer                 | not null
 net_cost          | double precision        | not null
 actual_cost       | double precision        | not null
 quantity          | double precision        | not null
 processing_date   | date                    | not null
 price_per_unit    | double precision        | not null
 chemical_id       | character varying(9)    | not null
 pct_id            | character varying(3)    | not null
 practice_id       | character varying(6)    | not null
 sha_id            | character varying(3)    | not null
Indexes:
    "frontend_items_pkey" PRIMARY KEY, btree (id)
    "frontend_items_45fff4c7" btree (sha_id)
    "frontend_items_4e2e609b" btree (pct_id)
    "frontend_items_528f368c" btree (processing_date)
    "frontend_items_6ea07fe3" btree (practice_id)
    "frontend_items_a69d813a" btree (chemical_id)
    "frontend_items_b9b2c7ab" btree (presentation_code)
    "frontend_items_chemical_id_varchar_pattern_ops_idx" btree (chemical_id varchar_pattern_ops)
    "frontend_items_pct_code_id_488a8bbfb2bddc6d_like" btree (pct_id varchar_pattern_ops)
    "frontend_items_practice_id_bbbafffdb2c2bf1_like" btree (practice_id varchar_pattern_ops)
    "frontend_items_presentation_code_69403ee04fda6522_like" btree (presentation_code varchar_pattern_ops)
    "frontend_items_presentation_code_varchar_pattern_ops_idx" btree (presentation_code varchar_pattern_ops)
Foreign-key constraints:
    "front_chemical_id_4619f68f65c49a8_fk_frontend_chemical_bnf_code" FOREIGN KEY (chemical_id) REFERENCES frontend_chemical(bnf_code) DEFERRABLE INITIALLY DEFERRED
    "frontend__practice_id_bbbafffdb2c2bf1_fk_frontend_practice_code" FOREIGN KEY (practice_id) REFERENCES frontend_practice(code) DEFERRABLE INITIALLY DEFERRED
    "frontend_items_pct_id_30c06df242c3d1ba_fk_frontend_pct_code" FOREIGN KEY (pct_id) REFERENCES frontend_pct(code) DEFERRABLE INITIALLY DEFERRED
    "frontend_items_sha_id_4fa0ca3c3b9b67f_fk_frontend_sha_code" FOREIGN KEY (sha_id) REFERENCES frontend_sha(code) DEFERRABLE INITIALLY DEFERRED

And here's the output of a verbose explain:

# explain (verbose, buffers, analyse) SELECT SUM(actual_cost) as cost, SUM(total_items) as num_items, processing_date FROM frontend_items WHERE chemical_id='0501012G0' GROUP BY processing_date;



    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1415349.73..1415349.74 rows=1 width=16) (actual time=3048.551..3048.556 rows=17 loops=1)
   Output: sum(actual_cost), sum(total_items), processing_date
   Buffers: shared hit=141958 read=12725
   ->  Bitmap Heap Scan on public.frontend_items  (cost=11797.55..1411446.84 rows=520385 width=16) (actual time=213.889..2834.911 rows=524644 loops=1)
         Output: id, presentation_code, presentation_name, total_items, net_cost, actual_cost, quantity, processing_date, price_per_unit, chemical_id, pct_id, practice_id, sha_id
         Recheck Cond: ((frontend_items.chemical_id)::text = '0501012G0'::text)
         Buffers: shared hit=141958 read=12725
         ->  Bitmap Index Scan on frontend_items_chemical_id_varchar_pattern_ops_idx  (cost=0.00..11667.46 rows=520385 width=0) (actual time=172.574..172.574 rows=524644 loops=1)
               Index Cond: ((frontend_items.chemical_id)::text = '0501012G0'::text)
               Buffers: shared hit=2 read=2012
 Total runtime: 3049.177 ms

解决方案

You have 724242 rows and the query takes 12592.499 ms. This is 0.017387 ms per row, i.e. 57514 rows per second. What are you complaining? I think your query is plenty fast. Ordinary HDDs support only rates of 65 - 200 rows per second by using an index scan, although the bitmap index / heap scan is faster. I think you'll find that PostgreSQL is using the best possible query plan for your situation.

If you execute the query again, does it get faster? The caches would be hot then, so a repeated execution might be faster. If it doesn't get faster, then it's unlikely that more memory would help. The data page size of PostgreSQL is 8 KB, so you're accessing at most 724242*8 KB = 5.5 GB of data, i.e. the data should fit into your RAM.

Edit: the second query mentioned on the edited version of the question shows a performance of 172000 rows per second. So, it is possible that such queries become faster if the data is cached in RAM. I would choose the approach of fitting the entire dataset in RAM. RAM is cheap, but developer time is expensive.

这篇关于有什么方法可以加快此Postgres位图堆扫描的速度吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆