使用 GROUP BY 与 DISTINCT 时的巨大性能差异 [英] Huge performance difference when using GROUP BY vs DISTINCT

查看:45
本文介绍了使用 GROUP BY 与 DISTINCT 时的巨大性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在一个 HSQLDB 服务器上执行一些测试,其中包含一个包含 500 000 个条目的表.该表没有索引.有 5000 个不同的业务键.我需要他们的清单.

I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them.

当然,我从一个 DISTINCT 查询开始:

Naturally I started with a DISTINCT query:

SELECT DISTINCT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR 
      value   <> 'closed';

大约需要 90 秒!!!

It takes around 90 seconds!!!

然后我尝试使用GROUP BY:

SELECT business_key
FROM memory
WHERE concept <> 'case'   OR 
      attrib  <> 'status' OR
      value   <> 'closed';
GROUP BY business_key

需要 1 秒!!!

试图找出差异我运行了 EXLAIN PLAN FOR 但它似乎为两个查询提供了相同的信息.

Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries.

EXLAIN Plan FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks


编辑

我做了额外的测试.HSQLDB 中有 500 000 条记录,包含所有不同的业务键,DISTINCT 的性能现在更好 - 3 秒,而 GROUP BY 花费了大约9 秒.


EDIT

I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

MySQL 中,两个查询执行相同的:

In MySQL both queries preform the same:

MySQL:500 000 行 - 5 000 个不同的业务键:两个查询:0.5 秒MySQL:500 000 行 - 所有不同的业务键:SELECT DISTINCT ... - 11 秒SELECT ... GROUP BY business_key - 13 秒

MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds

所以问题只与HSQLDB有关.

如果有人能解释为什么会有如此巨大的差异,我将不胜感激.

I will be very grateful if someone can explain why there is such a drastic difference.

推荐答案

这两个查询表达了相同的问题.显然查询优化器选择了两个不同的执行计划.我的猜测是 distinct 方法的执行方式如下:

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • 将所有 business_key 值复制到临时表中
  • 对临时表进行排序
  • 扫描临时表,返回与前一项不同的每一项

group by 可以像这样执行:

  • 扫描整个表,将业务键的每个值存储在一个哈希表中
  • 返回哈希表的键
  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

第一种方法针对内存使用进行了优化:当必须换出部分临时表时,它仍然会表现得相当好.第二种方法优化了速度,但如果有很多不同的键,则可能需要大量内存.

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

由于您要么有足够的内存,要么有几个不同的键,所以第二种方法优于第一种方法.在两个执行计划之间看到 10 倍甚至 100 倍的性能差异并不罕见.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

这篇关于使用 GROUP BY 与 DISTINCT 时的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆