SQL UNION ALL 消除重复 [英] SQL UNION ALL to eliminate duplicates

查看:128
本文介绍了SQL UNION ALL 消除重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现在 toptal 上发布的这个样本面试问题和答案转载于此处.但我真的不明白代码.UNION ALL 怎么能变成这样的UNIION(distinct)呢?另外,为什么这段代码更快?

I found this sample interview question and answer posted on toptal reproduced here. But I don't really understand the code. How can a UNION ALL turn into a UNIION (distinct) like that? Also, why is this code faster?

问题

使用 UNION ALL(而不是 UNION)编写 SQL 查询,该查询使用 WHERE 子句来消除重复项.你为什么要这样做?隐藏答案通过运行这样的查询,您可以使用 UNION ALL 避免重复,并且仍然比 UNION DISTINCT(实际上与 UNION 相同)运行得更快:

Write a SQL query using UNION ALL (not UNION) that uses the WHERE clause to eliminate duplicates. Why might you want to do this? Hide answer You can avoid duplicates using UNION ALL and still run much faster than UNION DISTINCT (which is actually same as UNION) by running a query like this:

答案

SELECT * FROM mytable WHERE a=X UNION ALL SELECT * FROM mytable WHERE b=Y AND a!=X

关键是 AND a!=X 部分.这为您提供了 UNION(也称为 UNION DISTINCT)命令的好处,同时避免了它的大部分性能损失.

推荐答案

但是在示例中,第一个查询在 a 列上有条件,而第二个查询在 列上有条件>b.这可能来自一个难以优化的查询:

But in the example, the first query has a condition on column a, whereas the second query has a condition on column b. This probably came from a query that's hard to optimize:

SELECT * FROM mytable WHERE a=X OR b=Y

这个查询很难用简单的 B 树索引来优化.引擎是否在 a 列上搜索索引?还是在 b 列上?无论哪种方式,搜索另一个术语都需要进行表扫描.

This query is hard to optimize with simple B-tree indexing. Does the engine search an index on column a? Or on column b? Either way, searching the other term requires a table-scan.

因此使用 UNION 将每个查询分成两个查询的技巧.每个子查询可以为每个搜索词使用最佳索引.然后使用 UNION 合并结果.

Hence the trick of using UNION to separate into two queries for one term each. Each subquery can use the best index for each search term. Then combine the results using UNION.

但是这两个子集可能会重叠,因为一些 b=Y 的行也可能有 a=X,在这种情况下,这些行出现在两个子集中.因此,您必须进行重复消除,否则在最终结果中会看到某些行两次.

But the two subsets may overlap, because some rows where b=Y may also have a=X in which case such rows occur in both subsets. Therefore you have to do duplicate elimination, or else see some rows twice in the final result.

SELECT * FROM mytable WHERE a=X 
UNION DISTINCT
SELECT * FROM mytable WHERE b=Y

UNION DISTINCT 开销很大,因为典型的实现对行进行排序以查找重复项.就像你使用 SELECT DISTINCT ... 一样.

UNION DISTINCT is expensive because typical implementations sort the rows to find duplicates. Just like if you use SELECT DISTINCT ....

我们也有一种看法,如果您联合的两个行子集在两个子集中都出现了很多行,则工作会更加浪费".需要消除很多行.

We also have a perception that it's even more "wasted" work if the two subset of rows you are unioning have a lot of rows occurring in both subsets. It's a lot of rows to eliminate.

但是如果您可以保证两组行已经不同,则无需消除重复项.也就是说,如果你保证没有重叠.如果您可以依赖它,那么消除重复项始终是无操作的,因此查询可以跳过该步骤,从而跳过代价高昂的排序.

But there's no need to eliminate duplicates if you can guarantee that the two sets of rows are already distinct. That is, if you guarantee there is no overlap. If you can rely on that, then it would always be a no-op to eliminate duplicates, and therefore the query can skip that step, and therefore skip the costly sorting.

如果您更改查询以保证它们选择不重叠的行子集,那就是胜利.

If you change the queries so that they are guaranteed to select non-overlapping subsets of rows, that's a win.

SELECT * FROM mytable WHERE a=X 
UNION ALL 
SELECT * FROM mytable WHERE b=Y AND a!=X

这两组保证没有重叠.如果第一组有 a=X 的行,而第二组有 a!=X 的行,那么两个集合中都不可能有行.

These two sets are guaranteed to have no overlap. If the first set has rows where a=X and the second set has rows where a!=X then there can be no row that is in both sets.

因此第二个查询只捕获一些b=Y的行,但a=X AND b=Y的任何行已包含在第一组中.

The second query therefore only catches some of the rows where b=Y, but any row where a=X AND b=Y is already included in the first set.

因此该查询实现了对两个 OR 术语的优化搜索,不会产生重复项,并且不需要 UNION DISTINCT 操作.

So the query achieves an optimized search for two OR terms, without producing duplicates, and requiring no UNION DISTINCT operation.

这篇关于SQL UNION ALL 消除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆