PostgreSQL的GIN索引比pg_trgm的GIST慢吗? [英] PostgreSQL GIN index slower than GIST for pg_trgm?
问题描述
尽管所有文档都说,但我发现pg_trgm相关搜索的GIN索引要比GIST索引慢得多。该表位于一个2500万行的表中,该表的文本字段相对较短(平均长度为21个字符)。文本的大多数行都是 123 Main st,City形式的地址。
Despite what all the documentation says, I'm finding GIN indexes to be significantly slower than GIST indexes for pg_trgm related searches. This is on a table of 25 million rows with a relatively short text field (average length of 21 characters). Most of the rows of text are addresses of the form "123 Main st, City".
GIST索引大约需要4秒钟的时间进行搜索,例如
GIST index takes about 4 seconds with a search like
select suggestion from search_suggestions where suggestion % 'seattle';
但是在运行 EXPLAIN ANALYZE <时,GIN需要90秒,并且以下结果
:
Bitmap Heap Scan on search_suggestions (cost=330.09..73514.15 rows=25043 width=22) (actual time=671.606..86318.553 rows=40482 loops=1)
Recheck Cond: ((suggestion)::text % 'seattle'::text)
Rows Removed by Index Recheck: 23214341
Heap Blocks: exact=7625 lossy=223807
-> Bitmap Index Scan on tri_suggestions_idx (cost=0.00..323.83 rows=25043 width=0) (actual time=669.841..669.841 rows=1358175 loops=1)
Index Cond: ((suggestion)::text % 'seattle'::text)
Planning time: 1.420 ms
Execution time: 86327.246 ms
请注意,即使实际上只有40k行匹配,索引也选择了超过一百万行。有什么想法为什么表现这么差吗?这是在PostgreSQL 9.4上。
Note that over a million rows are being selected by the index, even though only 40k rows actually match. Any ideas why this is performing so poorly? This is on PostgreSQL 9.4.
推荐答案
有些问题很突出:
首先,考虑升级到 Postgres的当前版本。在撰写本文时为9.6或pg 10(目前为beta)。自Pg 9.4起,GIN索引,一般的附加模块pg_trgm和大数据有了许多改进。
First, consider upgrading to a current version of Postgres. At the time of writing that's pg 9.6 or pg 10 (currently beta). Since Pg 9.4 there have been multiple improvements for GIN indexes, the additional module pg_trgm and big data in general.
下一步,您需要更多的 RAM >,尤其是更高的 work_mem
设置。我可以从 EXPLAIN
输出中的这一行看出来:
Next, you need much more RAM, in particular a higher work_mem
setting. I can tell from this line in the EXPLAIN
output:
Heap Blocks: exact=7625 lossy=223807
有损 的详细信息位图堆扫描(带有您的特定号码)表明 work_mem
严重短缺。 Postgres仅在位图索引扫描中收集块地址,而不是行指针,因为使用 work_mem
较低的设置(在RAM中无法保存确切地址),这有望更快。这样,必须在下面的 Bitmap Heap Scan 中过滤更多不合格的行。这个相关的答案有详细信息:
"lossy" in the details for a Bitmap Heap Scan (with your particular numbers) indicates a dramatic shortage of work_mem
. Postgres only collects block addresses in the bitmap index scan instead of row pointers because that's expected to be faster with your low work_mem
setting (can't hold exact addresses in RAM). Many more non-qualifying rows have to be filtered in the following Bitmap Heap Scan this way. This related answer has details:
- "Recheck Cond:" line in query plans with a bitmap index scan
但不要设置 work_mem
太太高而没有考虑整体情况:
But don't set work_mem
too high without considering the whole situation:
- Optimize simple query using ORDER BY date and text
还有其他问题,例如索引或表膨胀或更多配置瓶颈。但是,如果只修复这两个项目,查询应该已经更快了。
There may other problems, like index or table bloat or more configuration bottlenecks. But if you fix just these two items, the query should be much faster already.
此外,您真的需要检索所有4万行吗?在这个例子中?您可能想在查询中添加一个小的 LIMIT
并将其设为最近邻搜索-在这种情况下,GiST索引是毕竟是更好的选择,因为使用GiST索引, that 应该更快。示例:
Also, do you really need to retrieve all 40k rows in the example? You probably want to add a small LIMIT
to the query and make it a "nearest-neighbor" search - in which case a GiST index is the better choice after all, because that is supposed to be faster with a GiST index. Example:
- Best index for similarity function
这篇关于PostgreSQL的GIN索引比pg_trgm的GIST慢吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!