为什么以下联接会显着增加查询时间? [英] Why does the following join increase the query time significantly?
问题描述
我这里有一个星型模式,正在查询事实表,并希望加入一个非常小的尺寸表.我真的无法解释以下内容:
I have a star schema here and I am querying the fact table and would like to join one very small dimension table. I can't really explain the following:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id
FROM bi.impressions imp
GROUP BY imp.os_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
Total runtime: 12559.523 ms
(3 rows)
这大约需要12600毫秒,但是当然没有联接数据,因此我无法将imp.os_id解析为有意义的东西,因此我添加了联接:
This takes ~12600ms, but of course there is no joined data, so I can't "resolve" the imp.os_id to something meaningful, so I add a join:
EXPLAIN ANALYZE SELECT
COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp, bi.os_desc os
WHERE imp.os_id=os.os_id
GROUP BY imp.os_id, os.os_desc;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
-> Hash Join (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
Hash Cond: (imp.os_id = os.os_id)
-> Seq Scan on impressions imp (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
-> Hash (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 2kB
-> Seq Scan on os_desc os (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
Total runtime: 25565.199 ms
(8 rows)
这实际上使查询的执行时间加倍.我的问题是,我从图片中遗漏了什么?我认为这么小的查找不会引起查询执行时间的巨大差异.
This effectively doubles the execution time of my query. My question is, what did I leave out from the picture? I would think such a small lookup was not causing huge difference in query execution time.
推荐答案
使用(推荐)显式ANSI JOIN语法重写:
Rewritten with (recommended) explicit ANSI JOIN syntax:
SELECT COUNT(impression_id), imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os ON os.os_id = imp.os_id
GROUP BY imp.os_id, os.os_desc;
首先,如果在展示的每一行中,在os_desc
中找到的匹配项多于或少于一个,则您的第二个查询可能是错误的.
如果在os_id
上具有外键约束以确保引用完整性,并且在bi.impressions.os_id
上具有 NOT NULL
约束,则可以排除此情况.如果是这样,第一步,请简化为:
First of all, your second query might be wrong, if more or less than exactly one match are found in os_desc
for every row in impressions.
This can be ruled out if you have a foreign key constraint on os_id
in place, that guarantees referential integrity, plus a NOT NULL
constraint on bi.impressions.os_id
. If so, in a first step, simplify to:
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc
FROM bi.impressions imp
JOIN bi.os_desc os USING (os_id)
GROUP BY imp.os_id, os.os_desc;
count(*)
比count(column)
快,如果列为NOT NULL
,则在这里等效.并为计数添加列别名.
count(*)
is faster than count(column)
and equivalent here if the column is NOT NULL
. And add a column alias for the count.
更快,
SELECT os_id, os.os_desc, sub.ct
FROM (
SELECT os_id, COUNT(*) AS ct
FROM bi.impressions
GROUP BY 1
) sub
JOIN bi.os_desc os USING (os_id)
先聚集,然后加入.更多内容:
Aggregate first, join later. More here:
- Aggregate a single column in query with many columns
- PostgreSQL - order by an array
这篇关于为什么以下联接会显着增加查询时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!