如何在MySQL中正确使用GROUP BY? [英] How to properly GROUP BY in MySQL?

查看:171
本文介绍了如何在MySQL中正确使用GROUP BY?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下内容(为了演示的目的而非规范化)示例 CARS 表:

  | CAR_ID | OWNER_ID | OWNER_NAME | COLOR | 
| -------- | ---------- | ------------ | ------- |
| 1 | 1 |约翰|白色|
| 2 | 1 |约翰|黑色|
| 3 | 2 |迈克|白色|
| 4 | 2 |迈克|黑色|
| 5 | 2 |迈克|布朗|
| 6 | 3 |托尼|白色|

如果我想计算每个车主的车辆数量并返回:

  | OWNER_ID | OWNER_NAME |总计| 
| ---------- | ------------ | ------- |
| 1 |约翰| 2 |
| 2 |迈克| 3 |
| 3 |托尼| 1 |

我知道我可以写下列查询:

  SELECT owner_id,owner_name,COUNT(*)total FROM cars 
GROUP BY owner_id,owner_name

然而,从 GROUP BY owner_name $ c $>子句给了我相同的结果。


  1. 这两个查询有什么不同?

  2. 在什么情况下,我应该使用 SELECT 语句中的所有非累计字段进行分组,哪些不应该是我?

  3. 您可以举一个例子,说明在删除非聚合字段时该分组会返回不同的结果并解释原因吗?


解决方案

首先要明确的是,SQL不是MySQL。



在标准SQL中,不允许按子集非汇总字段。原因很简单。假设我正在运行这个查询:

pre $ SELECT color,owner_name,COUNT(*)FROM cars
GROUP BY颜色

该查询没有任何意义。即使试图解释它也是不可能的。确定它是选择颜色并计算每种颜色的汽车数量。但是,它也添加了 owner_name 字段,并且对于给定的颜色可以有多个所有者,例如 White 颜色。因此,如果单个 color 可能有许多 owner_name 值,这恰好是 GROUP BY 子句...然后哪个 owner_name 会被返回?



如果需要返回一个 owner_name ,那么应该添加一些标准来仅选择其中的一个,例如,按照字母顺序选择其中的一个,在这种情况下将是约翰。该标准将导致添加一个聚合函数 MIN(owner_name),然后查询将再次变得有意义,因为它将按照至少所有非聚集字段进行分组在select语句中。



正如您所看到的,标准SQL在分组中不够灵活是一个明显而实际的原因。如果不是这样,您可能会遇到一些尴尬的情况,其中列的值将不可预知,并且这不是一个好词,特别是当查询正在运行时向您显示您的银行帐户交易。



话虽如此,那为什么MySQL允许查询可能没有意义?更糟糕的是,上述查询中的错误可能只是语法检测!简短的答案是:表现。长时间的回答是,在某些情况下,根据数据关系,从组中获得不可预测的价值将会产生可预测的价值。



如果您没有如果能够预测出从组中获取不可预测因素所获得的价值,唯一的方法就是如果组中的所有元素都相同。这种情况的一个明显例子就是在同一个问题中的示例查询。看看 owner_id owner_name 在表格中的含义。很清楚,给定任何 owner_id ,例如 2 ,您只能有一个不同的 owner_name 。即使有很多行,通过选择任何行,您将得到 Mike 作为结果。在正式的数据库术语中,这可以解释为: owner_id 功能上确定 owner_name

>

让我们仔细看看完全正常工作的MySQL查询:

  SELECT owner_id ,owner_name,COUNT(*)total FROM cars 
GROUP BY owner_id

鉴于任何 owner_id 这将返回相同的 owner_name ,因此将它添加到 GROUP BY 子句不会导致返回更多的行。即使添加聚合函数 MAX(owner_name)也不会导致返回更少的行。结果数据将完全相同。在这两种情况下,查询都会立即变成一个合法的标准SQL查询,因为至少所有的非聚合字段都会被分组。因此,有三种方法可以获得相同的结果。

然而,正如我之前提到的,这种非标准分组具有性能优势。您可以查看此如此低估的链接,其中这是解释更多的细节,但我会引用最重要的部分:


您可以使用此功能获得更好的性能避免不必要的列排序和分组。 [...]服务器可以自由选择每组中的任何值,因此除非它们相同,否则所选值是不确定的。


值得一提的是,结果不一定是错误的,而是不确定 。换句话说,获得预期的结果并不意味着你写了正确的查询。编写正确的查询将始终为您提供预期的结果。

正如您所看到的,可能值得将这个MySQL扩展应用于 GROUP BY 子句。无论如何,如果这尚未完全清除,那么有一条经验法则可以确保您的分组总是正确的:始终按照select子句中的所有非聚合字段进行分组即可。在某些情况下,您可能会浪费几个CPU周期,但这比返回 indeterminate 结果要好。如果您仍然担心没有正确分组,请更改 ONLY_FULL_GROUP_BY SQL模式可能是最后一招:



您的分组是否正确,性能......或者至少是正确的。

I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:

| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
|      1 |        1 |       John | White |
|      2 |        1 |       John | Black |
|      3 |        2 |       Mike | White |
|      4 |        2 |       Mike | Black |
|      5 |        2 |       Mike | Brown |
|      6 |        3 |       Tony | White |

If I wanted to count the amount of cars per owner and return this:

| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
|        1 |       John |     2 |
|        2 |       Mike |     3 |
|        3 |       Tony |     1 |

I know I can write the following query:

SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name

However, removing owner_name from the GROUP BY clause gives me the same results.

  1. What is the difference between those 2 queries?
  2. Under what circumstances should I group by all non-agreggated fields in the SELECT statement and in which ones shouldn't I?
  3. Can you give an example in which this grouping would return different results when removing a non-aggregated field and explain why?

解决方案

The first thing to make clear is that SQL is not MySQL.

In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:

SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color

That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name field and there can be many owners for a given color, as it is the case of the White color. So if there can be many owner_name values for a single color which happens to be the only field in the GROUP BY clause... then which owner_name will be returned?

If it is needed to return an owner_name then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John. That criteria would result in adding an aggregate function MIN(owner_name) and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.

As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.

Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.

If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id and owner_name relates in the table. It is clear that given any owner_id, e.g. 2, you can only have one distinct owner_name. Even having many rows, by choosing any, you will get Mike as the result. In formal database jargon this can be explained as owner_id functionally determines owner_name.

Let's take a closer look at that fully working MySQL query:

SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id

Given any owner_id this would return the same owner_name, so adding it to the GROUP BY clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name) will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.

However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:

You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.

One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.

As you can see, it might be worth applying this MySQL extension to the GROUP BY clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY SQL mode could be a last resort :)

May your grouping be correct and performant... or at least correct.

这篇关于如何在MySQL中正确使用GROUP BY?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆