使用索引扫描,PostgreSQL查询运行得更快,但引擎选择散列连接 [英] PostgreSQL query runs faster with index scan, but engine chooses hash join
问题描述
查询:
SELECT "replays_game".*
FROM "replays_game"
INNER JOIN
"replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 50027
如果我设置 SET enable_seqscan = off
,那么它做的很快,那就是:
If I set SET enable_seqscan = off
, then it does the fast thing, which is:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..27349.80 rows=3395 width=72) (actual time=28.726..65.056 rows=3398 loops=1)
-> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.019..2.412 rows=3398 loops=1)
Index Cond: (player_id = 50027)
-> Index Scan using replays_game_pkey on replays_game (cost=0.00..5.41 rows=1 width=72) (actual time=0.017..0.017 rows=1 loops=3398)
Index Cond: (id = replays_playeringame.game_id)
Total runtime: 65.437 ms
但是如果没有可怕的enable_seqscan,它会选择做一个较慢的事情:
But without the dreaded enable_seqscan, it chooses to do a slower thing:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=7330.18..18145.24 rows=3395 width=72) (actual time=92.380..535.422 rows=3398 loops=1)
Hash Cond: (replays_playeringame.game_id = replays_game.id)
-> Index Scan using replays_playeringame_player_id on replays_playeringame (cost=0.00..8934.43 rows=3395 width=4) (actual time=0.020..2.899 rows=3398 loops=1)
Index Cond: (player_id = 50027)
-> Hash (cost=3668.08..3668.08 rows=151208 width=72) (actual time=90.842..90.842 rows=151208 loops=1)
Buckets: 1024 Batches: 32 (originally 16) Memory Usage: 1025kB
-> Seq Scan on replays_game (cost=0.00..3668.08 rows=151208 width=72) (actual time=0.020..29.061 rows=151208 loops=1)
Total runtime: 535.821 ms
以下是相关指数:
Index "public.replays_game_pkey"
Column | Type | Definition
--------+---------+------------
id | integer | id
primary key, btree, for table "public.replays_game"
Index "public.replays_playeringame_player_id"
Column | Type | Definition
-----------+---------+------------
player_id | integer | player_id
btree, for table "public.replays_playeringame"
所以我的问题是,什么是我错了,Postgres错误地估计了两种加入方式的相对成本?我在成本估算中看到认为哈希联接会更快。它对指数加入成本的估计减少了500倍。
So my question is, what am I doing wrong that Postgres is mis-estimating the relative costs of the two ways of joining? I see in the cost estimates that it thinks the hash-join will be faster. And its estimate of the cost of the index-join is off by a factor of 500.
我如何给Postgres更多线索?在运行上述所有操作之前,我确实运行了 VACUUM ANALYZE
。
How can I give Postgres more of a clue? I did run a VACUUM ANALYZE
immediately before running all of the above.
有趣的是,如果我运行此查询Postgres选择进行索引扫描+嵌套循环。所以关于大量游戏的一些事情会发现这种不受欢迎的行为,其中相对估计的成本与实际估计的成本不符。
Interestingly, if I run this query for a player with a smaller # of games, Postgres chooses to do the index-scan + nested-loop. So something about the large # of games tickles this undesired behavior where relative estimated cost is out of line with actual estimated cost.
最后,我应该使用Postgres吗?我不希望成为数据库调优方面的专家,因此我正在寻找一个能够在良好的开发人员注意力水平上表现出色的数据库,而不是专用的DBA。我担心,如果我坚持使用Postgres,我将会有一系列问题,这将迫使我成为Postgres专家,也许另一个DB会更宽容一种更随意的方法。
Finally, should I be using Postgres at all? I don't wish to become an expert in database tuning, so I'm looking for a database that will perform reasonably well with a conscientious developer's level of attention, as opposed to a dedicated DBA. I am afraid that if I stick with Postgres I will have a steady stream of issues like this that will force me to become a Postgres expert, and perhaps another DB will be more forgiving of a more casual approach.
Postgres专家(RhodiumToad)审查了我的完整数据库设置( http://pastebin.com/77QuiQSp )并建议设置cpu_tuple_cost = 0.1
。这给了一个戏剧性的加速: http://pastebin.com/nTHvSHVd
A Postgres expert (RhodiumToad) reviewed my full database settings (http://pastebin.com/77QuiQSp) and recommended set cpu_tuple_cost = 0.1
. That gave a dramatic speedup: http://pastebin.com/nTHvSHVd
或者,切换到MySQL也很好地解决了这个问题。我在我的OS X盒子上默认安装了MySQL和Postgres,MySQL的速度提高了2倍,比较了通过反复执行查询而预热的查询。在冷查询中,即第一次执行给定查询时,MySQL的速度要快5到150倍。冷查询的性能对于我的特定应用程序非常重要。
Alternatively, switching to MySQL also solved the problem pretty nicely. I have a default installation of MySQL and Postgres on my OS X box, and MySQL is 2x faster, comparing queries that are "warmed up" by repeatedly executing the query. On "cold" queries, i.e. the first time a given query is executed, MySQL is 5 to 150 times faster. The performance of cold queries is pretty important for my particular application.
就我而言,最重要的问题仍然很突出 - Postgres需要更多的摆弄和配置比MySQL运行良好?例如,请考虑评论者提供的建议都不起作用。
The big question, as far as I'm concerned, is still outstanding -- will Postgres require more fiddling and configuration to run well than MySQL? For example, consider that none of the suggestions offered by the commenters here worked.
推荐答案
我的猜测是你使用的是默认值 random_page_cost = 4
,这太高了,使索引扫描成本太高。
My guess is that you are using the default random_page_cost = 4
, which is way too high, making index scan too costly.
我尝试重建2带有此脚本的表:
I try to reconstruct the 2 tables with this script:
CREATE TABLE replays_game (
id integer NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE replays_playeringame (
player_id integer NOT NULL,
game_id integer NOT NULL,
PRIMARY KEY (player_id, game_id),
CONSTRAINT replays_playeringame_game_fkey
FOREIGN KEY (game_id) REFERENCES replays_game (id)
);
CREATE INDEX ix_replays_playeringame_game_id
ON replays_playeringame (game_id);
-- 150k games
INSERT INTO replays_game
SELECT generate_series(1, 150000);
-- ~150k players, ~2 games each
INSERT INTO replays_playeringame
select trunc(random() * 149999 + 1), generate_series(1, 150000);
INSERT INTO replays_playeringame
SELECT *
FROM
(
SELECT
trunc(random() * 149999 + 1) as player_id,
generate_series(1, 150000) as game_id
) AS t
WHERE
NOT EXISTS (
SELECT 1
FROM replays_playeringame
WHERE
t.player_id = replays_playeringame.player_id
AND t.game_id = replays_playeringame.game_id
)
;
-- the heavy player with 3000 games
INSERT INTO replays_playeringame
select 999999, generate_series(1, 3000);
默认值为4:
game=# set random_page_cost = 4;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1483.54..4802.54 rows=3000 width=4) (actual time=3.640..110.212 rows=3000 loops=1)
Hash Cond: (replays_game.id = replays_playeringame.game_id)
-> Seq Scan on replays_game (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.012..34.261 rows=150000 loops=1)
-> Hash (cost=1446.04..1446.04 rows=3000 width=4) (actual time=3.598..3.598 rows=3000 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 106kB
-> Bitmap Heap Scan on replays_playeringame (cost=67.54..1446.04 rows=3000 width=4) (actual time=0.586..2.041 rows=3000 loops=1)
Recheck Cond: (player_id = 999999)
-> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..66.79 rows=3000 width=0) (actual time=0.560..0.560 rows=3000 loops=1)
Index Cond: (player_id = 999999)
Total runtime: 110.621 ms
将其降低到2:
game=# set random_page_cost = 2;
SET
game=# explain analyse SELECT "replays_game".*
FROM "replays_game"
INNER JOIN "replays_playeringame" ON "replays_game"."id" = "replays_playeringame"."game_id"
WHERE "replays_playeringame"."player_id" = 999999;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=45.52..4444.86 rows=3000 width=4) (actual time=0.418..27.741 rows=3000 loops=1)
-> Bitmap Heap Scan on replays_playeringame (cost=45.52..1424.02 rows=3000 width=4) (actual time=0.406..1.502 rows=3000 loops=1)
Recheck Cond: (player_id = 999999)
-> Bitmap Index Scan on replays_playeringame_pkey (cost=0.00..44.77 rows=3000 width=0) (actual time=0.388..0.388 rows=3000 loops=1)
Index Cond: (player_id = 999999)
-> Index Scan using replays_game_pkey on replays_game (cost=0.00..0.99 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=3000)
Index Cond: (id = replays_playeringame.game_id)
Total runtime: 28.542 ms
(8 rows)
如果使用SSD,我将它进一步降低到1.1。
If using SSD, I would lower it further to 1.1.
关于你的上一个问题,我认为你应该坚持使用postgresql。我有使用postgresql和mssql的经验,我需要将后续工作投入三倍,以使其执行一半以及前者。
As for your last question, I really think you should stick with postgresql. I have experience with postgresql and mssql, and I need to put in triple the effort into the later for it to perform half as well as the former.
这篇关于使用索引扫描,PostgreSQL查询运行得更快,但引擎选择散列连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!