将TOP与GROUP BY一起使用 [英] Use something like TOP with GROUP BY

查看:239
本文介绍了将TOP与GROUP BY一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用表 table1 如下所示

  + ---- ---- + ------- + ------- + ------------ + ------- + 
|飞行orig | dest |乘客|箱包|
+ -------- + ------- + ------- + ------------ + ------- +
| 1111 | sfo | |大卫| 3 |
| 1112 | sfo | dal |大卫| 7 |
| 1112 | sfo | dal |金| 10 |
| 1113 |松懈san | ameera | 5 |
| 1114 |松懈lfr |蒂姆| 6 |
| 1114 |松懈lfr |杰克8 |
+ -------- + ------- + ------- + ------------ + ------- +

我正在按 orig 如下所示

 选择
orig
,count(*)flight_cnt
,count(不重复的乘客)作为pass_cnt
,组内的percentile_cont(0.5)(按行李ASC的顺序订购)作为table1
的bag_cnt_med
,由orig
分组pre>

我需要添加名称最长的乘客 length(passenger) )中的每个 orig 组-我该如何处理?



预期输出

  + ------ + ----------- -+ ----------- + --------------- + ------------------- + 
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+ ------ + ------------- + ----------- + ------------ --- + ------------------- +
| sfo | 3 | 2 | 7 |大卫|
|松懈3 | 3 | 6 | ameera |
+ ------ + ------------- + ----------- + ------------ --- + ------------------- +


解决方案

您可以使用 DISTINCT ON 方便地检索每组最长姓名的乘客。





但是我看不到将其组合的方法(或其他方法)简单方法),将您的原始查询放在单个 SELECT 中。我建议加入两个单独的子查询:

  SELECT * 
FROM(-您的原始查询
SELECT orig
,count(*)AS flight_cnt
,count(不同乘客)AS pass_cnt
,percentile_cont(0.5)WITHIN GROUP(ORDER BY袋)AS bag_cnt_med
FROM table1
GROUP BY orig
)org_query
JOIN(-我的附加
选择DISTINCT ON(orig)orig,旅客AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger)DESC NULLS LAST
)pas useing(orig); join子句中的

USING 方便地仅输出 orig 的一个实例,因此您可以在 SELECT外部使用 SELECT * code>。



如果乘客可以为NULL,则添加空值最后





在同一组中具有相同最大长度的多个乘客姓名中,您会得到任意选择-除非您在 ORDER BY 作为决胜局。



性能?



通常情况下,单次扫描效果更好,尤其是在



上面的查询使用了两次扫描(也许是索引/仅索引扫描)。但是第二次扫描相对便宜,除非表太大(无法容纳)。 Lukas建议使用仅 SELECT 的替代查询添加:

 ,(ARRAY_AGG(乘客按长度排序(乘客)DESC))[1]-我'd添加NULLS LAST 

这个想法很聪明,但是上次测试 array_agg ORDER BY 表现不佳。 (每个组 ORDER BY 的开销很大,数组处理也很昂贵。)



使用自定义聚合函数 first() 如Postgres Wiki此处的指示所示。或者,更快,使用一个用C编写的版本,可在PGXN上使用。消除了数组处理的额外费用,但我们仍然需要按组 ORDER BY 可能会更快,仅适用于少数几个小组。然后,您将添加:

 ,第一(乘客ORDER BY长度(乘客)DESC NULLS LAST)

Gordon Lukas 还提到了窗口函数 first_value() 。窗口函数是在聚集函数之后 应用的。要在同一 SELECT 中使用它,我们需要先汇总乘客 -陷阱22.戈登用子查询解决了这个问题-另一个使用标准Postgres的性能良好的候选人。



first()在没有子查询的情况下也一样,应该更简单,更快一些。但是对于大多数情况,每组很少有行,它仍然不会比单独的 DISTINCT ON 更快。对于每个组很多行,递归CTE技术通常更快。如果您有一个单独的表来保存所有相关的,唯一的 orig 值,那么还有更快的技术。详细信息:





最佳解决方案取决于多种因素。吃的时候有布丁的证明。要优化性能,您必须对设置进行测试。上面的查询应该是最快的查询。


With table table1 like below

+--------+-------+-------+------------+-------+
| flight |  orig |  dest |  passenger |  bags |
+--------+-------+-------+------------+-------+
|   1111 |  sfo  |  chi  |  david     |     3 |
|   1112 |  sfo  |  dal  |  david     |     7 |
|   1112 |  sfo  |  dal  |  kim       |     10|
|   1113 |  lax  |  san  |  ameera    |     5 |
|   1114 |  lax  |  lfr  |  tim       |     6 |
|   1114 |  lax  |  lfr  |  jake      |     8 |
+--------+-------+-------+------------+-------+

I'm aggregating the table by orig like below

select 
  orig
  , count(*) flight_cnt
  , count(distinct passenger) as pass_cnt
  , percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig

I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?

Output expected

+------+-------------+-----------+---------------+-------------------+
| orig |  flight_cnt |  pass_cnt |  bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo  |           3 |         2 |             7 |  david            |
| lax  |           3 |         3 |             6 | ameera            |
+------+-------------+-----------+---------------+-------------------+

解决方案

You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.

But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:

SELECT *
FROM  (  -- your original query
   SELECT orig
        , count(*) AS flight_cnt
        , count(distinct passenger) AS pass_cnt
        , percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
   FROM   table1
   GROUP  BY orig
   ) org_query
JOIN  (  -- my addition
   SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
   FROM   table1
   ORDER  BY orig, length(passenger) DESC NULLS LAST
   ) pas USING (orig);

USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.

If passenger can be NULL, it is important to add NULLS LAST:

From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.

Performance?

Typically, a single scan is superior, especially with sequential scans.

The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:

, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1]  -- I'd add NULLS LAST

The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)

The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:

 , first(passenger ORDER BY length(passenger) DESC NULLS LAST)

Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.

first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:

The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.

这篇关于将TOP与GROUP BY一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆