大行集的IN与JOIN [英] IN vs. JOIN with large rowsets
问题描述
我想选择一个表中的行,其中主键位于另一个表中.我不确定在SQL Server 2005中是否应该使用JOIN或IN运算符.在具有大型数据集(即数百万行)的这两个SQL查询之间,是否存在明显的性能差异?
I'm wanting to select rows in a table where the primary key is in another table. I'm not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any significant performance difference between these two SQL queries with a large dataset (i.e. millions of rows)?
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a JOIN b ON a.c = b.d
推荐答案
更新:
我博客中的这篇文章总结了我的答案和对其他答案的评论,并显示了实际的执行计划:
This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
SELECT *
FROM a
WHERE a.c IN (SELECT d FROM b)
SELECT a.*
FROM a
JOIN b
ON a.c = b.d
这些查询不相同.如果您的表b
不是键保存的(即b.d
的值不是唯一的),它们会产生不同的结果.
These queries are not equivalent. They can yield different results if your table b
is not key preserved (i. e. the values of b.d
are not unique).
第一个查询的等效项如下:
The equivalent of the first query is the following:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT d
FROM b
) bo
ON a.c = bo.d
如果b.d
是UNIQUE
并标记为这样(用UNIQUE INDEX
或UNIQUE CONSTRAINT
),则这些查询是相同的,并且很可能将使用相同的计划,因为SQL Server
足够聪明,可以执行此操作考虑在内.
If b.d
is UNIQUE
and marked as such (with a UNIQUE INDEX
or UNIQUE CONSTRAINT
), then these queries are identical and most probably will use identical plans, since SQL Server
is smart enough to take this into account.
SQL Server
可以采用以下方法之一来运行此查询:
SQL Server
can employ one of the following methods to run this query:
-
如果
a.c
上有索引,d
为UNIQUE
,并且b
与a
相比较小,则该条件将传播到子查询和普通INNER JOIN
用于(以b
开头)
If there is an index on
a.c
,d
isUNIQUE
andb
is relatively small compared toa
, then the condition is propagated into the subquery and the plainINNER JOIN
is used (withb
leading)
如果b.d
上有索引,而d
不是UNIQUE
,则该条件也会传播,并使用LEFT SEMI JOIN
.也可以用于上述条件.
If there is an index on b.d
and d
is not UNIQUE
, then the condition is also propagated and LEFT SEMI JOIN
is used. It can also be used for the condition above.
如果b.d
和a.c
上都有索引并且它们很大,则使用MERGE SEMI JOIN
If there is an index on both b.d
and a.c
and they are large, then MERGE SEMI JOIN
is used
如果任何表上都没有索引,则在b
上构建哈希表,并使用HASH SEMI JOIN
.
If there is no index on any table, then a hash table is built on b
and HASH SEMI JOIN
is used.
都不是每次都会重新评估整个子查询.
Neither of these methods reevaluates the whole subquery each time.
有关其工作原理的更多详细信息,请参见我的博客中的条目:
See this entry in my blog for more detail on how this works:
四大公司的所有RDBMS
都有链接.
There are links for all RDBMS
's of the big four.
这篇关于大行集的IN与JOIN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!