使用group vs vs distinct的巨大性能差异 [英] Huge performance difference when using group by vs distinct
问题描述
我正在一个包含500 000个条目的表上对 HSQLDB
服务器执行一些测试。该表没有索引。有5000个不同的业务密钥。我需要他们的清单。当然,我从 DISTINCT
查询开始:
SELECT DISTINCT business_key FROM memory WHERE
概念<> 'case'或
attrib<> '状态'或
值<> '关闭'
大约需要90秒!!!
然后我尝试使用 GROUP BY
:
SELECT business_key FROM内存WHERE
概念<> 'case'或
attrib<> '状态'或
值<> 'closed'
GROUP BY business_key
它需要1秒!!!
试图找出差异,我运行了 EXLAIN PLAN FOR
,但它似乎给出了两个查询的相同信息。 p>
EXLAIN PLAN FOR DISTINCT ...
isAggregated = [false]
列= [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY
]
[范围变量1
连接类型= INNER
table = MEMORY
alias = M
access = FULL SCAN
condition = [index = SYS_IDX_SYS_PK_10057_10058
other condition = [
OR arg_left = [
OR arg_left = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.CONCEPT] arg_right = [
VALUE = case,TYPE = CHARACTER]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.ATTRIB] arg_right = [
VALUE = status,TYPE = CHARACTER]]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN: PUBL IC.MEMORY.VALUE] arg_right = [
VALUE = closed,TYPE = CHARACTER]]]
]
]]
PARAMETERS = []
SUBQUERIES []
对象引用
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
读锁
PUBLIC.MEMORY
WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect = [false]
isGrouped = [true]
isAggregated = [false]
列= [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY
]
[范围变量1
join type = INNER
table = MEMORY
alias = M
access = FULL SCAN
condition = [index = SYS_IDX_SYS_PK_10057_10058
other condition = [
OR arg_left = [
OR arg_left = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.CONCEPT] arg_right = [
VALUE = case,TYPE = CHARACTER]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.ATTRIB] arg_right = [
VALUE = status,TYPE = CHARACTER]]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.VALUE] arg_right = [
VALUE =已关闭,TYPE = CHARACTER]]]
]
]]
groupColumns = [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS = []
SUBQUERIES []
对象引用
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
读取锁
PUBLIC.MEMORY
WriteLocks
编辑:
我做了额外的测试。在 HSQLDB
中具有所有不同业务密钥的500 000条记录, DISTINCT
的性能现在更好 - 3秒,vs GROUP BY
耗时约9秒。
在 MySQL
MySQL:500 000行 - 5000个不同的业务密钥:
两个查询:0.5秒
MySQL:500 000行 - 所有不同的业务键:
SELECT DISTINCT ...
- 11秒
SELECT ... GROUP BY business_key - 13秒
所以这个问题只与 HSQLDB
有关。
如果有人能解释为什么会有这么大的差异,我将非常感激。
这两个查询表达了相同的问题。显然,查询优化器选择两个不同的执行计划。我的猜测是, distinct
方法执行如下:
business_key
值添加到临时表中
执行如下:
业务密钥
在散列表中
第一种方法优化内存使用:当临时表的一部分必须被换出时,它仍然会表现得相当好。第二种方法优化速度,但如果有很多不同的键,可能需要大量的内存。
由于您拥有足够的内存或几个不同的键,第二种方法胜过第一种。在两个执行计划之间看到10倍甚至100倍的性能差异并不罕见。
I am performing some tests on a HSQLDB
server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT
query:
SELECT DISTINCT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
It takes around 90 seconds!!!
Then I tried using GROUP BY
:
SELECT business_key FROM memory WHERE
concept <> 'case' or
attrib <> 'status' or
value <> 'closed'
GROUP BY business_key
And it takes 1 second!!!
Trying to figure out the difference I ran EXLAIN PLAN FOR
but it seems to give the same information for both queries.
EXLAIN PLAN FOR DISTINCT ...
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
join type=INNER
table=MEMORY
alias=M
access=FULL SCAN
condition = [ index=SYS_IDX_SYS_PK_10057_10058
other condition=[
OR arg_left=[
OR arg_left=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
VALUE = case, TYPE = CHARACTER]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
VALUE = status, TYPE = CHARACTER]]] arg_right=[
NOT_EQUAL arg_left=[
COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
VALUE = closed, TYPE = CHARACTER]]]
]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks
EDIT:
I did additional tests. With 500 000 records in HSQLDB
with all distinct business keys, the performance of DISTINCT
is now better - 3 seconds, vs GROUP BY
which took around 9 seconds.
In MySQL
both queries preform the same:
MySQL: 500 000 rows - 5 000 distinct business keys:
Both queries: 0.5 second
MySQL: 500 000 rows - all distinct business keys:
SELECT DISTINCT ...
- 11 seconds
SELECT ... GROUP BY business_key
- 13 seconds
So the problem is only related to HSQLDB
.
I will be very grateful if someone can explain why there is such a drastic difference.
The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct
approach is executed like:
- Copy all
business_key
values to a temporary table - Sort the temporary table
- Scan the temporary table, returning each item that is different from the one before it
The group by
could be executed like:
- Scan the full table, storing each value of
business key
in a hashtable - Return the keys of the hashtable
The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.
Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.
这篇关于使用group vs vs distinct的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!