使用group vs vs distinct的巨大性能差异 [英] Huge performance difference when using group by vs distinct

查看：107 发布时间：2018/5/30 13:33:28 sql performance group-by distinct hsqldb

本文介绍了使用group vs vs distinct的巨大性能差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在一个包含500 000个条目的表上对 HSQLDB 服务器执行一些测试。该表没有索引。有5000个不同的业务密钥。我需要他们的清单。当然，我从 DISTINCT 查询开始：

  SELECT DISTINCT business_key FROM memory WHERE 
概念<> 'case'或
 attrib<> '状态'或
值<> '关闭'

大约需要90秒!!!

然后我尝试使用 GROUP BY ：

  SELECT business_key FROM内存WHERE 
概念<> 'case'或
 attrib<> '状态'或
值<> 'closed'
 GROUP BY business_key

它需要1秒!!!

试图找出差异，我运行了 EXLAIN PLAN FOR ，但它似乎给出了两个查询的相同信息。 p>

EXLAIN PLAN FOR DISTINCT ...
isAggregated = [false] 列= [ COLUMN：PUBLIC.MEMORY.BUSINESS_KEY ] [范围变量1 连接类型= INNER table = MEMORY alias = M access = FULL SCAN condition = [index = SYS_IDX_SYS_PK_10057_10058 other condition = [ OR arg_left = [ OR arg_left = [ NOT_EQUAL arg_left = [ COLUMN：PUBLIC.MEMORY.CONCEPT] arg_right = [ VALUE = case，TYPE = CHARACTER]] arg_right = [ NOT_EQUAL arg_left = [ COLUMN：PUBLIC.MEMORY.ATTRIB] arg_right = [ VALUE = status，TYPE = CHARACTER]]] arg_right = [ NOT_EQUAL arg_left = [ COLUMN： PUBL IC.MEMORY.VALUE] arg_right = [ VALUE = closed，TYPE = CHARACTER]]] ] ]] PARAMETERS = [] SUBQUERIES [] 对象引用 PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY 读锁 PUBLIC.MEMORY WriteLocks
EXLAIN PLAN FOR SELECT ... GROUP BY ...
isDistinctSelect = [false] isGrouped = [true] isAggregated = [false] 列= [ COLUMN：PUBLIC.MEMORY.BUSINESS_KEY ] [范围变量1 join type = INNER table = MEMORY alias = M access = FULL SCAN condition = [index = SYS_IDX_SYS_PK_10057_10058 other condition = [ OR arg_left = [ OR arg_left = [ NOT_EQUAL arg_left = [ COLUMN：PUBLIC.MEMORY.CONCEPT] arg_right = [ VALUE = case，TYPE = CHARACTER]] arg_right = [ NOT_EQUAL arg_left = [ COLUMN：PUBLIC.MEMORY.ATTRIB] arg_right = [ VALUE = status，TYPE = CHARACTER]]] arg_right = [ NOT_EQUAL arg_left = [ COLUMN：PUBLIC.MEMORY.VALUE] arg_right = [ VALUE =已关闭，TYPE = CHARACTER]]] ] ]] groupColumns = [ COLUMN：PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS = [] SUBQUERIES [] 对象引用 PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY 读取锁 PUBLIC.MEMORY WriteLocks
编辑：
我做了额外的测试。在 HSQLDB 中具有所有不同业务密钥的500 000条记录， DISTINCT 的性能现在更好 - 3秒，vs GROUP BY 耗时约9秒。

在 MySQL

MySQL：500 000行 - 5000个不同的业务密钥：
两个查询：0.5秒
MySQL：500 000行 - 所有不同的业务键：
SELECT DISTINCT ... - 11秒
SELECT ... GROUP BY business_key - 13秒
所以这个问题只与 HSQLDB 有关。如果有人能解释为什么会有这么大的差异，我将非常感激。解决方案这两个查询表达了相同的问题。显然，查询优化器选择两个不同的执行计划。我的猜测是， distinct 方法执行如下：全部复制 business_key 值添加到临时表中排序临时表扫描临时表，返回每个项目与之前的项目不同执行如下：扫描整个表，存储业务密钥在散列表中返回散列表的键第一种方法优化内存使用：当临时表的一部分必须被换出时，它仍然会表现得相当好。第二种方法优化速度，但如果有很多不同的键，可能需要大量的内存。由于您拥有足够的内存或几个不同的键，第二种方法胜过第一种。在两个执行计划之间看到10倍甚至100倍的性能差异并不罕见。 I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them. Naturally I started with a DISTINCT query: SELECT DISTINCT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed' It takes around 90 seconds!!! Then I tried using GROUP BY: SELECT business_key FROM memory WHERE concept <> 'case' or attrib <> 'status' or value <> 'closed' GROUP BY business_key And it takes 1 second!!! Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries. EXLAIN PLAN FOR DISTINCT ... isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks EXLAIN PLAN FOR SELECT ... GROUP BY ... isDistinctSelect=[false] isGrouped=[true] isAggregated=[false] columns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1 join type=INNER table=MEMORY alias=M access=FULL SCAN condition = [ index=SYS_IDX_SYS_PK_10057_10058 other condition=[ OR arg_left=[ OR arg_left=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[ VALUE = case, TYPE = CHARACTER]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[ VALUE = status, TYPE = CHARACTER]]] arg_right=[ NOT_EQUAL arg_left=[ COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[ VALUE = closed, TYPE = CHARACTER]]] ] ]] groupColumns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks EDIT: I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds. In MySQL both queries preform the same: MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds So the problem is only related to HSQLDB. I will be very grateful if someone can explain why there is such a drastic difference. 解决方案 The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like: Copy all business_key values to a temporary table Sort the temporary table Scan the temporary table, returning each item that is different from the one before it The group by could be executed like: Scan the full table, storing each value of business key in a hashtable Return the keys of the hashtable The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys. Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans. 这篇关于使用group vs vs distinct的巨大性能差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用group vs vs distinct的巨大性能差异 [英] Huge performance difference when using group by vs distinct

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用group vs vs distinct的巨大性能差异 [英] Huge performance difference when using group by vs distinct

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭