PostgreSQL的GIN索引比pg_trgm的GIST慢吗? [英] PostgreSQL GIN index slower than GIST for pg_trgm?

查看:427
本文介绍了PostgreSQL的GIN索引比pg_trgm的GIST慢吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管所有文档都说,但我发现pg_trgm相关搜索的GIN索引要比GIST索引慢得多。该表位于一个2500万行的表中,该表的文本字段相对较短(平均长度为21个字符)。文本的大多数行都是 123 Main st,City形式的地址。

Despite what all the documentation says, I'm finding GIN indexes to be significantly slower than GIST indexes for pg_trgm related searches. This is on a table of 25 million rows with a relatively short text field (average length of 21 characters). Most of the rows of text are addresses of the form "123 Main st, City".

GIST索引大约需要4秒钟的时间进行搜索,例如

GIST index takes about 4 seconds with a search like

select suggestion from search_suggestions where suggestion % 'seattle';

但是在运行 EXPLAIN ANALYZE <时,GIN需要90秒,并且以下结果

Bitmap Heap Scan on search_suggestions  (cost=330.09..73514.15 rows=25043 width=22) (actual time=671.606..86318.553 rows=40482 loops=1)
  Recheck Cond: ((suggestion)::text % 'seattle'::text)
  Rows Removed by Index Recheck: 23214341
  Heap Blocks: exact=7625 lossy=223807
  ->  Bitmap Index Scan on tri_suggestions_idx  (cost=0.00..323.83 rows=25043 width=0) (actual time=669.841..669.841 rows=1358175 loops=1)
        Index Cond: ((suggestion)::text % 'seattle'::text)
Planning time: 1.420 ms
Execution time: 86327.246 ms

请注意,即使实际上只有40k行匹配,索引也选择了超过一百万行。有什么想法为什么表现这么差吗?这是在PostgreSQL 9.4上。

Note that over a million rows are being selected by the index, even though only 40k rows actually match. Any ideas why this is performing so poorly? This is on PostgreSQL 9.4.

推荐答案

有些问题很突出:

首先,考虑升级到 Postgres的当前版本。在撰写本文时为9.6或pg 10(目前为beta)。自Pg 9.4起,GIN索引,一般的附加模块pg_trgm和大数据有了许多改进。

First, consider upgrading to a current version of Postgres. At the time of writing that's pg 9.6 or pg 10 (currently beta). Since Pg 9.4 there have been multiple improvements for GIN indexes, the additional module pg_trgm and big data in general.

下一步,您需要更多的 RAM >,尤其是更高的 work_mem 设置。我可以从 EXPLAIN 输出中的这一行看出来:

Next, you need much more RAM, in particular a higher work_mem setting. I can tell from this line in the EXPLAIN output:

Heap Blocks: exact=7625 lossy=223807

有损 的详细信息位图堆扫描(带有您的特定号码)表明 work_mem 严重短缺。 Postgres仅在位图索引扫描中收集块地址,而不是行指针,因为使用 work_mem 较低的设置(在RAM中无法保存确切地址),这有望更快。这样,必须在下面的 Bitmap Heap Scan 中过滤更多不合格的行。这个相关的答案有详细信息:

"lossy" in the details for a Bitmap Heap Scan (with your particular numbers) indicates a dramatic shortage of work_mem. Postgres only collects block addresses in the bitmap index scan instead of row pointers because that's expected to be faster with your low work_mem setting (can't hold exact addresses in RAM). Many more non-qualifying rows have to be filtered in the following Bitmap Heap Scan this way. This related answer has details:

  • "Recheck Cond:" line in query plans with a bitmap index scan

但不要设置 work_mem 太高而没有考虑整体情况:

But don't set work_mem too high without considering the whole situation:

  • Optimize simple query using ORDER BY date and text

还有其他问题,例如索引或表膨胀或更多配置瓶颈。但是,如果只修复这两个项目,查询应该已经更快了。

There may other problems, like index or table bloat or more configuration bottlenecks. But if you fix just these two items, the query should be much faster already.

此外,您真的需要检索所有4万行吗?在这个例子中?您可能想在查询中添加一个小的 LIMIT 并将其设为最近邻搜索-在这种情况下,GiST索引是毕竟是更好的选择,因为使用GiST索引, that 应该更快。示例:

Also, do you really need to retrieve all 40k rows in the example? You probably want to add a small LIMIT to the query and make it a "nearest-neighbor" search - in which case a GiST index is the better choice after all, because that is supposed to be faster with a GiST index. Example:

  • Best index for similarity function

这篇关于PostgreSQL的GIN索引比pg_trgm的GIST慢吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆