xpath表达式上的Postgresql索引没有加速 [英] Postgresql index on xpath expression gives no speed up

查看:110
本文介绍了xpath表达式上的Postgresql索引没有加速的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试在Postgresql中创建OEBS模拟功能。假设我们有一个表单构造函数,需要将表单结果存储在数据库中(例如电子邮件正文)。在Oracle中,您可以使用具有150个列的表(以及存储在其他位置的某些映射)来将每个字段存储在单独的列中。但与Oracle相反,我们希望将所有表单存储在postgresql xml字段中。
树的例子是

We are trying to create OEBS-analog functionality in Postgresql. Let's say we have a form constructor and need to store form results in database (e.g. email bodies). In Oracle you could use a table with 150~ columns (and some mapping stored elsewhere) to store each field in separate column. But in contrast to Oracle we would like to store all the form in postgresql xml field. The example of the tree is

 <row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
   <object_id>2</object_id>  
   <pack_form_id>23</pack_form_id>  
   <prod_form_id>34</prod_form_id>
 </row>

我们想搜索这个字段。
测试表包含400k行,以下选择在90秒内执行:

We would like to search through this field. Test table contains 400k rows and the following select executes in 90 seconds:

select * 
from params 
where (xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int=34

所以我创建了这个索引:

So I created this index:

create index prod_form_idx 
ON params using btree(
   ((xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int)
);

并没有任何区别。仍然执行90秒。 EXPLAIN计划显示:

And it made no difference. Still 90 seconds execution. EXPLAIN plan show this:

Bitmap Heap Scan on params  (cost=40.29..6366.44 rows=2063 width=292)
  Recheck Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
  ->  Bitmap Index Scan on prod_form_idx  (cost=0.00..39.78 rows=2063 width=0)
        Index Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)

我不是伟大的计划解释器所以我想这意味着正在使用索引。问题是:速度在哪里?为了优化这种查询,我该怎么做?

I am not the great plan interpreter so I suppose this means that index is being used. The question is: where's all the speed? And what can i do in order to optimize this kind of queries?

推荐答案

好吧,至少使用索引。您获得了位图索引扫描而不是普通的索引扫描,这意味着xpath()函数将被多次调用。

Well, at least the index is used. You get a bitmap index scan instead of a normal index scan though, which means the xpath() function will be called lots of times.

让我们做一点检查:

CREATE TABLE foo ( id serial primary key, x xml, h hstore );
insert into foo (x,h) select XMLPARSE( CONTENT '<row  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
   <object_id>2</object_id>  
   <pack_form_id>' || n || '</pack_form_id>  
   <prod_form_id>34</prod_form_id>
 </row>' ), 
('object_id=>2,prod_form_id=>34,pack_form_id=>'||n)::hstore 
FROM generate_series( 1,100000 ) n;

test=> EXPLAIN ANALYZE SELECT count(*) FROM foo;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4821.00..4821.01 rows=1 width=0) (actual time=24.694..24.694 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..4571.00 rows=100000 width=0) (actual time=0.006..13.996 rows=100000 loops=1)
 Total runtime: 24.730 ms

test=> explain analyze select * from foo where (h->'pack_form_id')='123';
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..5571.00 rows=500 width=68) (actual time=0.075..48.763 rows=1 loops=1)
   Filter: ((h -> 'pack_form_id'::text) = '123'::text)
 Total runtime: 36.808 ms

test=> explain analyze select * from foo where ((xpath('//pack_form_id/text()'::text, x))[1]::text) = '123';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..5071.00 rows=500 width=68) (actual time=4.271..3368.838 rows=1 loops=1)
   Filter: (((xpath('//pack_form_id/text()'::text, x, '{}'::text[]))[1])::text = '123'::text)
 Total runtime: 3368.865 ms

我们可以看到,


  • 用count(*)扫描整个表需要25 ms

  • 从hstore中提取一个键/值会增加一个小的额外成本,大约0.12μs/行

  • 使用xpath从xml中提取一个键/值会增加巨大的成本,大约33μs/行

结论:


  • xml很慢(但每个人都知道)

  • 如果要在列中放置灵活的键/值存储,请使用hstore

此外,由于您的xml数据非常大,它将被烘烤(压缩并存储在主表之外)。这使得主表中的行更小,因此每页更多行,这降低了位图扫描的效率,因为页面上的所有行都必须重新检查。

Also since your xml data is pretty big it will be toasted (compressed and stored out of the main table). This makes the rows in the main table much smaller, hence more rows per page, which reduces the efficiency of bitmap scans since all rows on a page have to be rechecked.

你可以解决这个问题。由于某种原因,xpath()函数(非常慢,因为它处理xml)具有与所述相同的成本(1个单位),整数运算符+...

You can fix this though. For some reason the xpath() function (which is very slow, since it handles xml) has the same cost (1 unit) as say, the integer operator "+"...

update pg_proc set procost=1000 where proname='xpath';

您可能需要调整成本值。当给出正确的信息时,规划器知道xpath很慢并且将使用索引扫描来避免位图索引扫描,而不需要重新检查页面上所有行的条件。

You may need to tweak the cost value. When given the right info, the planner knows xpath is slow and will avoid a bitmap index scan, using an index scan instead, which doesn't need rechecking the condition for all rows on a page.

请注意,这并不能解决行估计问题。由于您无法分析xml(或hstore)的内部,因此您将获得行数的默认估计值(此处为500)。因此,如果涉及某些联接,计划者可能完全错误并选择灾难性计划。唯一的解决方案是使用正确的列。

Note that this does not solve the row estimates problem. Since you can't ANALYZE the inside of the xml (or hstore) you get default estimates for the number of rows (here, 500). So, the planner may be completely wrong and choose a catastrophic plan if some joins are involved. The only solution to this is to use proper columns.

这篇关于xpath表达式上的Postgresql索引没有加速的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆