获取具有最高/最小<任何值>的记录每组 [英] Get records with highest/smallest <whatever> per group

查看:66
本文介绍了获取具有最高/最小<任何值>的记录每组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该怎么做?

此问题的原标题是"在带有子查询的复杂查询中使用等级(@Rank:= @Rank + 1)-可以吗?",因为我一直在寻找使用等级的解决方案,但是现在我看到Bill所发布的解决方案要好得多.

Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.

原始问题:

我正在尝试编写一个查询,该查询将按照给定的顺序从每个组中获取最后一条记录:

I'm trying to compose a query that would take last record from each group given some defined order:

SET @Rank=0;

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from Table
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from Table
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField

表达式@Rank := @Rank + 1通常用于等级,但对我来说,在2个子查询中使用时似乎可疑,但仅初始化一次.这样行吗?

Expression @Rank := @Rank + 1 is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?

第二,它将与一个被多次评估的子查询一起使用吗?像where(或having)子句中的子查询(另一种写上面的方法):

And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):

SET @Rank=0;

select Table.*, @Rank := @Rank + 1 AS Rank
from Table
having Rank = (select max(Rank) AS MaxRank
              from (select GroupId, @Rank := @Rank + 1 AS Rank 
                    from Table as t0
                    order by OrderField
                    ) as t
              where t.GroupId = table.GroupId
             )
order by OrderField

提前谢谢!

推荐答案

因此,您想获得每个组中OrderField最高的行吗?我会这样:

So you want to get the row with the highest OrderField per group? I'd do it this way:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField
WHERE t2.GroupId IS NULL
ORDER BY t1.OrderField; // not needed! (note by Tomas)

( Tomas的:如果同一组中有更多具有相同OrderField的记录,而您恰好需要其中之一,则可能需要扩展条件:

(EDIT by Tomas: If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:

SELECT t1.*
FROM `Table` AS t1
LEFT OUTER JOIN `Table` AS t2
  ON t1.GroupId = t2.GroupId 
        AND (t1.OrderField < t2.OrderField 
         OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id))
WHERE t2.GroupId IS NULL

编辑结束.)

换句话说,返回行t1,对于该行t1,不存在其他任何具有相同GroupId和更大的OrderField的行t2.当t2.*为NULL时,表示左外部联接未找到这样的匹配项,因此t1在组中具有最大的OrderField值.

In other words, return the row t1 for which no other row t2 exists with the same GroupId and a greater OrderField. When t2.* is NULL, it means the left outer join found no such match, and therefore t1 has the greatest value of OrderField in the group.

没有等级,没有子查询.如果您在(GroupId, OrderField)上有复合索引,这应该可以快速运行并使用使用索引"优化对t2的访问.

No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField).

关于性能,请参见我对检索每个组中的最后一条记录的答案.我尝试了使用堆栈溢出数据转储的子查询方法和联接方法.区别非常明显:在我的测试中,join方法的运行速度快了278倍.

Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.

具有正确的索引以获得最佳结果很重要!

It's important that you have the right index to get the best results!

关于使用@Rank变量的方法,它在您编写时将不起作用,因为在查询处理完第一个表之后,@ Rank的值不会重置为零.我给你看一个例子.

Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.

我插入了一些虚拟数据,其中一个额外字段为null,但在我们知道每组最大的行上除外:

I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:

select * from `Table`;

+---------+------------+------+
| GroupId | OrderField | foo  |
+---------+------------+------+
|      10 |         10 | NULL |
|      10 |         20 | NULL |
|      10 |         30 | foo  |
|      20 |         40 | NULL |
|      20 |         50 | NULL |
|      20 |         60 | foo  |
+---------+------------+------+

我们可以证明,第一组的排名增加到3,第二组的排名增加到6,并且内部查询正确地返回了这些信息:

We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:

select GroupId, max(Rank) AS MaxRank
from (
  select GroupId, @Rank := @Rank + 1 AS Rank
  from `Table`
  order by OrderField) as t
group by GroupId

+---------+---------+
| GroupId | MaxRank |
+---------+---------+
|      10 |       3 |
|      20 |       6 |
+---------+---------+

现在在没有连接条件的情况下运行查询,以强制所有行的笛卡尔积,并且我们还获取所有列:

Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:

select s.*, t.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+---------+---------+------------+------+------+
| GroupId | MaxRank | GroupId | OrderField | foo  | Rank |
+---------+---------+---------+------------+------+------+
|      10 |       3 |      10 |         10 | NULL |    7 |
|      20 |       6 |      10 |         10 | NULL |    7 |
|      10 |       3 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         20 | NULL |    8 |
|      20 |       6 |      10 |         30 | foo  |    9 |
|      10 |       3 |      10 |         30 | foo  |    9 |
|      10 |       3 |      20 |         40 | NULL |   10 |
|      20 |       6 |      20 |         40 | NULL |   10 |
|      10 |       3 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         50 | NULL |   11 |
|      20 |       6 |      20 |         60 | foo  |   12 |
|      10 |       3 |      20 |         60 | foo  |   12 |
+---------+---------+---------+------------+------+------+

从上面我们可以看到每组的最大排名是正确的,但是@Rank在处理第二个派生表时继续增加,直到7或更高.因此,第二个派生表中的等级根本不会与第一个派生表中的等级完全重叠.

We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.

您必须添加另一个派生表,以在处理两个表之间强制@Rank重置为零(并希望优化器不要更改其评估表的顺序,否则请使用STRAIGHT_JOIN来防止这种情况发生):

You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):

select s.*
from (select GroupId, max(Rank) AS MaxRank
      from (select GroupId, @Rank := @Rank + 1 AS Rank 
            from `Table`
            order by OrderField
            ) as t
      group by GroupId) as t 
  join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE
  join (
      select *, @Rank := @Rank + 1 AS Rank
      from `Table`
      order by OrderField
      ) as s 
  on t.GroupId = s.GroupId and t.MaxRank = s.Rank
order by OrderField;

+---------+------------+------+------+
| GroupId | OrderField | foo  | Rank |
+---------+------------+------+------+
|      10 |         30 | foo  |    3 |
|      20 |         60 | foo  |    6 |
+---------+------------+------+------+

但是此查询的优化非常糟糕.它不能使用任何索引,它会创建两个临时表,对它们进行艰难的排序,甚至使用连接缓冲区,因为它在连接临时表时也无法使用索引.这是EXPLAIN:

But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN:

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived4> | system | NULL          | NULL | NULL    | NULL |    1 | Using temporary; Using filesort |
|  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |    2 |                                 |
|  1 | PRIMARY     | <derived5> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using where; Using join buffer  |
|  5 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
|  4 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used                  |
|  2 | DERIVED     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using temporary; Using filesort |
|  3 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  |
+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+

而我的使用左外部联接的解决方案优化得更好.它不使用临时表,甚至不报告"Using index",这意味着它可以仅使用索引来解决联接,而无需处理数据.

Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index" which means it can resolve the join using only the index, without touching the data.

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                    |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL    | NULL    | NULL            |    6 | Using filesort           |
|  1 | SIMPLE      | t2    | ref  | GroupId       | GroupId | 5       | test.t1.GroupId |    1 | Using where; Using index |
+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+

您可能会读到人们在他们的博客上宣称加入会使SQL变慢"的说法,但这是无稽之谈.最差的优化会使SQL变慢.

You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.

这篇关于获取具有最高/最小&lt;任何值&gt;的记录每组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆