JOIN与WHERE:为什么两个获得相同结果的查询表现出3-4个数量级的性能差异? [英] JOIN vs. WHERE: Why do two queries that obtain identical results exhibit 3-4 orders of magnitude performance difference?

查看:54
本文介绍了JOIN与WHERE:为什么两个获得相同结果的查询表现出3-4个数量级的性能差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今晚早些时候,我问了收到了包含两个可能的SQL查询的答案 ,它们都可以正常工作.

成功的SQL#1:

SELECT code1, code2
FROM myTable
WHERE code1 IN 
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)

成功的SQL#2:

SELECT t.code1, code2
FROM myTable t
  INNER JOIN
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)
     s on s.code1 = t.code1

正如我在答案下方的评论中所述:

myTable具有〜30000行,只有大约400个重复组,并且 每个重复组几乎总是只有2个条目.在我的MySQL实例上 运行在高端工作站上的 SQL#1 需要大约30分钟的时间来执行, 而 SQL#2 需要一秒钟的时间.

这是上述两个查询之间的性能差异 3-4个数量级.

让我感到困扰的是,在查询中,为什么我的表现并不立即明显?为什么在我的用例中,一个表现比另一个表现要好三个数量级.

我想对SQL执行的内部结构有更好的了解,这个特殊的示例非常有用.

我的问题是:为什么在我的用例中,SQL#2的性能要比SQL#1快5,000倍?

解决方案

MySQL具有已知问题.在5.6.5版之前,它不会实现子查询,但是会实现联接中使用的派生表.

从本质上讲,这意味着当您使用联接时,第一次遇到子查询时,MySQL将执行以下操作:

SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1

并将结果保存在临时表中(对哈希表进行哈希处理以加快查找速度),然后对于myTable中的每个值,它将针对临时表进行查找以查看代码是否存在.

但是,由于当您使用IN时,子查询并未实现,而是被重写为:

SELECT t1.code1, t1.code2
FROM myTable t1
WHERE EXISTS
    (   SELECT t2.code1 
        FROM myTable t2
        WHERE t2.Code1 = t1.Code1
        GROUP BY t2.code1 
        HAVING COUNT(t2.code1) > 1
    )

这意味着对于myTable中的每个code,它将再次运行子查询.当您的外部查询非常狭窄时,哪种方法比较好,因为只运行几次子查询比对所有值运行它并将结果存储在临时表中更为有效,但是当您的外部查询很宽时,它会得到结果在内部查询中执行了很多次,这就是性能差异的所在.

因此,对于您的行计数,您无需运行子查询约30,000次,而是运行一次,然后在仅包含400行的哈希临时表中查找约30,000行.这将导致如此巨大的性能差异. /p>

在线文档中的这篇文章解释了子查询进行更深入的优化.

Earlier tonight, I asked this question on StackOverflow regarding how to write a SQL query to filter rows from a table by returning only rows with duplicates in one field.

Here is the question, repeated for convenience:

If I have this data:

code1 code2
  1    10       <-- Desired (1 appears more than once)
  1    11       <-- Desired (1 appears more than once)
  2    20
  3    30       <-- Desired (3 appears more than once)
  3    31       <-- Desired (3 appears more than once)
  4    40
  5    50

... And I want to write a single SQL query whose results are this:

code1 code2
  1    10       <-- This result appears because 1 appears more than once above
  1    11       <-- This result appears because 1 appears more than once above
  3    30       <-- This result appears because 3 appears more than once above
  3    31       <-- This result appears because 3 appears more than once above

(i.e, a single SQL query that returns all rows for which any data in the code1 column appears more than once)...

How do I do it?

I received an answer with two possible SQL queries, both of which work perfectly.

Successful SQL #1:

SELECT code1, code2
FROM myTable
WHERE code1 IN 
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)

Successful SQL #2:

SELECT t.code1, code2
FROM myTable t
  INNER JOIN
    (SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1)
     s on s.code1 = t.code1

As I describe in a comment beneath the answer:

myTable has ~30000 rows, with only about 400 duplicate groups, and almost always just 2 entries per duplicate group. On my MySQL instance running on a high-end workstation, SQL #1 takes ~30 minutes to execute, whereas SQL #2 requires a fraction of a second.

This is a three to four orders of magnitude difference in performance between the two queries above.

It troubles me that it is not immediately obvious to me, looking at the queries, why one should perform three orders of magnitude better than the other in my use-case.

I would like to have a better understanding of the internals of SQL execution, and this particular example is excellent to assist with that.

My question is: Why does SQL #2 exhibit performance that is about 5,000 times faster than SQL #1 in my use case?

解决方案

MySQL has known issues with optimising queries involving correlated subqueries, or subselects. Up until version 5.6.5 it does not materialise subqueries, however it will materialise a derived table used in a join.

In essence this means that when you use a join, the first time the subquery is encountered MySQL will perform the following:

SELECT code1 FROM myTable GROUP BY code1 HAVING COUNT(code1) > 1

And keep the results in a temporary table (which is hashed to make lookups faster), then for each value in myTable it will lookup against the temporary table to see if the code is there.

However, since when you use IN the subquery is not materialised and is rewritten like:

SELECT t1.code1, t1.code2
FROM myTable t1
WHERE EXISTS
    (   SELECT t2.code1 
        FROM myTable t2
        WHERE t2.Code1 = t1.Code1
        GROUP BY t2.code1 
        HAVING COUNT(t2.code1) > 1
    )

Which means that for each code in myTable, it runs the subquery again. Which when your outer query is very narrow is fine, as it is more efficient to only run the subquery a few times, than run it for all values and store the results in a temporary table, however when your outer query is wide, it results in the inner query executing many times, and this is where the performance difference kicks in.

So for your row counts, instead of running the subquery ~30,000 times, you run it once, then lookup ~30,000 rows against a hashed temporary table with only 400 rows in. This would account for such a drastic performance difference.

This article in the online docs explains subquery optimisation in much more depth.

这篇关于JOIN与WHERE:为什么两个获得相同结果的查询表现出3-4个数量级的性能差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆