使用group vs vs distinct的巨大性能差异 [英] Huge performance difference when using group by vs distinct

查看:107
本文介绍了使用group vs vs distinct的巨大性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在一个包含500 000个条目的表上对 HSQLDB 服务器执行一些测试。该表没有索引。有5000个不同的业务密钥。我需要他们的清单。当然,我从 DISTINCT 查询开始:

  SELECT DISTINCT business_key FROM memory WHERE 
概念<> 'case'或
attrib<> '状态'或
值<> '关闭'

大约需要90秒!!!

然后我尝试使用 GROUP BY

  SELECT business_key FROM内存WHERE 
概念<> 'case'或
attrib<> '状态'或
值<> 'closed'
GROUP BY business_key

它需要1秒!!!



试图找出差异,我运行了 EXLAIN PLAN FOR ,但它似乎给出了两个查询的相同信息。 p>

EXLAIN PLAN FOR DISTINCT ...

  isAggregated = [false] 
列= [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY
]
[范围变量1
连接类型= INNER
table = MEMORY
alias = M
access = FULL SCAN
condition = [index = SYS_IDX_SYS_PK_10057_10058
other condition = [
OR arg_left = [
OR arg_left = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.CONCEPT] arg_right = [
VALUE = case,TYPE = CHARACTER]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.ATTRIB] arg_right = [
VALUE = status,TYPE = CHARACTER]]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN: PUBL IC.MEMORY.VALUE] arg_right = [
VALUE = closed,TYPE = CHARACTER]]]
]
]]
PARAMETERS = []
SUBQUERIES []
对象引用
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
读锁
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

  isDistinctSelect = [false] 
isGrouped = [true]
isAggregated = [false]
列= [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY
]
[范围变量1
join type = INNER
table = MEMORY
alias = M
access = FULL SCAN
condition = [index = SYS_IDX_SYS_PK_10057_10058
other condition = [
OR arg_left = [
OR arg_left = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.CONCEPT] arg_right = [
VALUE = case,TYPE = CHARACTER]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.ATTRIB] arg_right = [
VALUE = status,TYPE = CHARACTER]]] arg_right = [
NOT_EQUAL arg_left = [
COLUMN:PUBLIC.MEMORY.VALUE] arg_right = [
VALUE =已关闭,TYPE = CHARACTER]]]
]
]]
groupColumns = [
COLUMN:PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS = []
SUBQUERIES []
对象引用
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
读取锁
PUBLIC.MEMORY
WriteLocks

编辑
我做了额外的测试。在 HSQLDB 中具有所有不同业务密钥的500 000条记录, DISTINCT 的性能现在更好 - 3秒,vs GROUP BY 耗时约9秒。



MySQL

MySQL:500 000行 - 5000个不同的业务密钥:
两个查询:0.5秒
MySQL:500 000行 - 所有不同的业务键:
SELECT DISTINCT ... - 11秒
SELECT ... GROUP BY business_key - 13秒



所以这个问题只与 HSQLDB 有关。



如果有人能解释为什么会有这么大的差异,我将非常感激。

解决方案

这两个查询表达了相同的问题。显然,查询优化器选择两个不同的执行计划。我的猜测是, distinct 方法执行如下:


  • 全部复制 business_key 值添加到临时表中
  • 排序临时表

  • 扫描临时表,返回每个项目与之前的项目不同



执行如下:


  • 扫描整个表,存储业务密钥在散列表中

  • 返回散列表的键



第一种方法优化内存使用:当临时表的一部分必须被换出时,它仍然会表现得相当好。第二种方法优化速度,但如果有很多不同的键,可能需要大量的内存。



由于您拥有足够的内存或几个不同的键,第二种方法胜过第一种。在两个执行计划之间看到10倍甚至100倍的性能差异并不罕见。

I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT query:

SELECT DISTINCT business_key FROM memory WHERE
   concept <> 'case' or 
   attrib <> 'status' or 
   value <> 'closed'

It takes around 90 seconds!!!

Then I tried using GROUP BY:

SELECT business_key FROM memory WHERE
       concept <> 'case' or 
       attrib <> 'status' or 
       value <> 'closed'
GROUP BY business_key

And it takes 1 second!!!

Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries.

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[false]
columns=[
  COLUMN: PUBLIC.MEMORY.BUSINESS_KEY
]
[range variable 1
  join type=INNER
  table=MEMORY
  alias=M
  access=FULL SCAN
  condition = [    index=SYS_IDX_SYS_PK_10057_10058
    other condition=[
    OR arg_left=[
     OR arg_left=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[
       VALUE = case, TYPE = CHARACTER]] arg_right=[
      NOT_EQUAL arg_left=[
       COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[
       VALUE = status, TYPE = CHARACTER]]] arg_right=[
     NOT_EQUAL arg_left=[
      COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[
      VALUE = closed, TYPE = CHARACTER]]]
  ]
]]
groupColumns=[
COLUMN: PUBLIC.MEMORY.BUSINESS_KEY]
PARAMETERS=[]
SUBQUERIES[]
Object References
PUBLIC.MEMORY
PUBLIC.MEMORY.CONCEPT
PUBLIC.MEMORY.ATTRIB
PUBLIC.MEMORY.VALUE
PUBLIC.MEMORY.BUSINESS_KEY
Read Locks
PUBLIC.MEMORY
WriteLocks

EDIT: I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

In MySQL both queries preform the same:

MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds

So the problem is only related to HSQLDB.

I will be very grateful if someone can explain why there is such a drastic difference.

解决方案

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

  • Copy all business_key values to a temporary table
  • Sort the temporary table
  • Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

  • Scan the full table, storing each value of business key in a hashtable
  • Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

这篇关于使用group vs vs distinct的巨大性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆