将TOP与GROUP BY一起使用 [英] Use something like TOP with GROUP BY
问题描述
使用表 table1
如下所示
+ ---- ---- + ------- + ------- + ------------ + ------- +
|飞行orig | dest |乘客|箱包|
+ -------- + ------- + ------- + ------------ + ------- +
| 1111 | sfo | |大卫| 3 |
| 1112 | sfo | dal |大卫| 7 |
| 1112 | sfo | dal |金| 10 |
| 1113 |松懈san | ameera | 5 |
| 1114 |松懈lfr |蒂姆| 6 |
| 1114 |松懈lfr |杰克8 |
+ -------- + ------- + ------- + ------------ + ------- +
我正在按 orig
如下所示
选择
分组pre>
orig
,count(*)flight_cnt
,count(不重复的乘客)作为pass_cnt
,组内的percentile_cont(0.5)(按行李ASC的顺序订购)作为table1
的bag_cnt_med
,由orig
我需要添加名称最长的
乘客
(length(passenger)
)中的每个orig
组-我该如何处理?
预期输出
+ ------ + ----------- -+ ----------- + --------------- + ------------------- +
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+ ------ + ------------- + ----------- + ------------ --- + ------------------- +
| sfo | 3 | 2 | 7 |大卫|
|松懈3 | 3 | 6 | ameera |
+ ------ + ------------- + ----------- + ------------ --- + ------------------- +
解决方案您可以使用
DISTINCT ON
方便地检索每组最长姓名的乘客。
但是我看不到将其组合的方法(或其他方法)简单方法),将您的原始查询放在单个 SELECT
中。我建议加入两个单独的子查询:
SELECT *
FROM(-您的原始查询
SELECT orig
,count(*)AS flight_cnt
,count(不同乘客)AS pass_cnt
,percentile_cont(0.5)WITHIN GROUP(ORDER BY袋)AS bag_cnt_med
FROM table1
GROUP BY orig
)org_query
JOIN(-我的附加
选择DISTINCT ON(orig)orig,旅客AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger)DESC NULLS LAST
)pas useing(orig); join子句中的
USING
方便地仅输出 orig
的一个实例,因此您可以在 SELECT外部使用
SELECT *
code>。
如果乘客
可以为NULL,则添加空值最后
:
在同一组中具有相同最大长度的多个乘客姓名中,您会得到任意选择-除非您在 ORDER BY $ c中添加更多表达式$ c>作为决胜局。
性能?
通常情况下,单次扫描效果更好,尤其是在
上面的查询使用了两次扫描(也许是索引/仅索引扫描)。但是第二次扫描相对便宜,除非表太大(无法容纳)。 Lukas建议使用仅单 SELECT
的替代查询添加:
,(ARRAY_AGG(乘客按长度排序(乘客)DESC))[1]-我'd添加NULLS LAST
这个想法很聪明,但是上次测试, array_agg
和 ORDER BY
表现不佳。 (每个组 ORDER BY
的开销很大,数组处理也很昂贵。)
使用自定义聚合函数 first()
如Postgres Wiki此处的指示所示。或者,更快,使用一个用C编写的版本,可在PGXN上使用。消除了数组处理的额外费用,但我们仍然需要按组 ORDER BY
。 可能会更快,仅适用于少数几个小组。然后,您将添加:
,第一(乘客ORDER BY长度(乘客)DESC NULLS LAST)
Gordon 和 Lukas 还提到了窗口函数 first_value()
。窗口函数是在聚集函数之后 应用的。要在同一 SELECT
中使用它,我们需要先汇总乘客
-陷阱22.戈登用子查询解决了这个问题-另一个使用标准Postgres的性能良好的候选人。
first()
在没有子查询的情况下也一样,应该更简单,更快一些。但是对于大多数情况,每组很少有行,它仍然不会比单独的 DISTINCT ON
更快。对于每个组很多行,递归CTE技术通常更快。如果您有一个单独的表来保存所有相关的,唯一的 orig
值,那么还有更快的技术。详细信息:
最佳解决方案取决于多种因素。吃的时候有布丁的证明。要优化性能,您必须对设置进行测试。上面的查询应该是最快的查询。
With table table1
like below
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
I'm aggregating the table by orig
like below
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
I need to add the passenger
with the longest name ( length(passenger)
) for each orig
group - how do I go about it?
Output expected
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON
.
But I see no way to combine that (or any other simple way) with your original query in a single SELECT
. I suggest to join two separate subqueries:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
USING
in the join clause conveniently only outputs one instance of orig
, so you can simply use SELECT *
in the outer SELECT
.
If passenger
can be NULL, it is important to add NULLS LAST
:
From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY
as tiebreaker. Detailed explanation in the answer linked above.
Performance?
Typically, a single scan is superior, especially with sequential scans.
The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT
adding:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
The idea is smart, but last time I tested, array_agg
with ORDER BY
did not perform so well. (The overhead of per-group ORDER BY
is substantial, and array handling is expensive, too.)
The same approach can be cheaper with a custom aggregate function first()
like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY
. May be faster for only few groups. You would then add:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon and Lukas also mention the window function first_value()
. Window functions are applied after aggregate functions. To use it in the same SELECT
, we would need to aggregate passenger
somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.
first()
does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON
for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig
values. Details:
The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.
这篇关于将TOP与GROUP BY一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!