获取最高/最小的记录<whatever>每组 [英] Get records with highest/smallest <whatever> per group

查看:36
本文介绍了获取最高/最小的记录<whatever>每组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

怎么做?

这个问题的前标题是在带有子查询的复杂查询中使用等级(@Rank := @Rank + 1) - 会起作用吗?"因为我正在寻找使用等级的解决方案,但是现在我看到比尔发布的解决方案要好得多.

Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.

原始问题:

我正在尝试编写一个查询,该查询将从每个组中获取最后一条记录给定一些定义的顺序:

I'm trying to compose a query that would take last record from each group given some defined order:

SET @Rank=0;

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from Table
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from Table
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField

表达式 @Rank := @Rank + 1 通常用于排名,但对我来说,在 2 个子查询中使用时看起来很可疑,但只初始化一次.会这样吗?

Expression @Rank := @Rank + 1 is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?

其次,它是否适用于一个被多次评估的子查询?像 where (or have) 子句中的子查询(另一种写法):

And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):

SET @Rank=0;

select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
              from (select GroupId, @Rank := @Rank + 1 AS Rank 
                    from Table as t0
                    order by OrderField
                    ) as t
              where t.GroupId = table.GroupId
             )
order by OrderField

提前致谢!

推荐答案

所以您想获得每组 OrderField 最高的行?我会这样做:

So you want to get the row with the highest OrderField per group? I'd do it this way:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed! (note by Tomas)

(由 Tomas 如果同一组中有更多具有相同 OrderField 的记录,而您正好需要其中之一,您可能需要扩展条件:

(EDIT by Tomas: If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId 
        AND (t1.OrderField < t2.OrderField 
         OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL

编辑结束.)

换句话说,返回没有其他行 t2 的行 t1 具有相同的 GroupId 和更大的 OrderField.当 t2.* 为 NULL 时,表示左外连接没有找到这样的匹配,因此 t1 具有 OrderField 的最大值团体.

In other words, return the row t1 for which no other row t2 exists with the same GroupId and a greater OrderField. When t2.* is NULL, it means the left outer join found no such match, and therefore t1 has the greatest value of OrderField in the group.

没有排名,没有子查询.如果您在 (GroupId, OrderField) 上有复合索引,这应该运行得很快,并通过使用索引"优化对 t2 的访问.

No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField).

关于性能,请参阅我对检索每组中的最后一条记录的回答.我尝试了使用 Stack Overflow 数据转储的子查询方法和连接方法.区别很明显:在我的测试中,join 方法的运行速度快了 278 倍.

Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.

拥有正确的索引以获得最佳结果很重要!

It's important that you have the right index to get the best results!

关于您使用@Rank 变量的方法,它不会像您编写的那样工作,因为在查询处理第一个表后@Rank 的值不会重置为零.我给你举个例子.

Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.

我插入了一些虚拟数据,除了我们知道每组最大的行外,还有一个额外的字段为空:

I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:

select * from `Table`;

+---------+------------+------+
| GroupId | OrderField | foo  |
+---------+------------+------+
|      10 |         10 | NULL |
|      10 |         20 | NULL |
|      10 |         30 | foo  |
|      20 |         40 | NULL |
|      20 |         50 | NULL |
|      20 |         60 | foo  |
+---------+------------+------+

我们可以证明第一组的排名增加到 3,第二组的排名增加到 6,并且内部查询正确返回这些:

We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:

select GroupId, max(Rank) AS MaxRank
from (
  select GroupId, @Rank := @Rank + 1 AS Rank
  from `Table`
  order by OrderField) as t
group by GroupId

+---------+---------+
| GroupId | MaxRank |
+---------+---------+
|      10 |       3 |
|      20 |       6 |
+---------+---------+

现在运行没有连接条件的查询,强制所有行的笛卡尔积,我们还获取所有列:

Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:

select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo  | Rank |
+---------+---------+---------+------------+------+------+
|      10 |       3 |      10 |         10 | NULL |    7 |
|      20 |       6 |      10 |         10 | NULL |    7 |
|      10 |       3 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         30 | foo  |    9 |
|      10 |       3 |      10 |         30 | foo  |    9 |
|      10 |       3 |      20 |         40 | NULL |   10 |
|      20 |       6 |      20 |         40 | NULL |   10 |
|      10 |       3 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         60 | foo  |   12 |
|      10 |       3 |      20 |         60 | foo  |   12 |
+---------+---------+---------+------------+------+------+

从上面我们可以看到,每个组的最大排名是正确的,但是@Rank 在处理第二个派生表时继续增加,达到 7 及更高.因此,第二个派生表的排名永远不会与第一个派生表的排名重叠.

We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.

您必须添加另一个派生表以强制 @Rank 在处理两个表之间重置为零(并希望优化器不会更改它评估表的顺序,否则使用 STRAIGHT_JOIN 来防止这种情况发生):

You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+------------+------+------+
| GroupId | OrderField | foo  | Rank |
+---------+------------+------+------+
|      10 |         30 | foo  |    3 |
|      20 |         60 | foo  |    6 |
+---------+------------+------+------+

但是这个查询的优化很糟糕.它不能使用任何索引,它创建了两个临时表,对它们进行了艰难的排序,甚至使用了连接缓冲区,因为它在连接临时表时也不能使用索引.这是 EXPLAIN 的示例输出:

But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN:

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived4> | system | NULL          | NULL | NULL    | NULL |    1 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |    2 |                                 |
|  1 | PRIMARY     | <derived5> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using where; Using join buffer  |
|  5 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
|  4 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used                  |
|  2 | DERIVED     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using temporary; Using filesort |
|  3 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+

而我使用左外连接的解决方案优化得更好.它不使用临时表,甚至报告Using index",这意味着它可以仅使用索引解析连接,而无需触及数据.

Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index" which means it can resolve the join using only the index, without touching the data.

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                    |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL    | NULL    | NULL            |    6 | Using filesort           |
|  1 | SIMPLE      | t2    | ref  | GroupId       | GroupId | 5       | test.t1.GroupId |    1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+

您可能会读到有人在他们的博客上声称联接使 SQL 变慢",但这是无稽之谈.优化不当导致 SQL 变慢.

You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.

这篇关于获取最高/最小的记录&lt;whatever&gt;每组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆