通过Amazon Redshift / PostgreSQL中的CASE表达式提高效率GROUP BY [英] Efficient GROUP BY a CASE expression in Amazon Redshift/PostgreSQL

查看:197
本文介绍了通过Amazon Redshift / PostgreSQL中的CASE表达式提高效率GROUP BY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分析处理中,通常需要在结果表中将不重要的数据组合成一行。一种方法是对GROUP BY CASE表达式进行GROUP BY表达,其中通过CASE表达式将不重要的组合并成单行,返回单个值,例如对于组为NULL。这个问题是关于在基于ParAccel的Amazon Redshift中执行此分组的有效方法:在功能方面接近PosgreSQL 8.0。



举例来说,考虑类型 url 在一个表格中,每一行都是一个URL访问。我们的目标是执行聚合,以便为每个(类型,url)对发射一行,其中URL访问次数超过特定阈值,并为 all 发出一个(type,NULL)行,url)对,访问次数低于该阈值。结果表中的其余列将具有基于此分组的SUM / COUNT聚合。



例如,以下数据

  + ------ + ---------------------- + ---- ------------------- + 
|键入| url | < 50多个其他栏目> |
+ ------ + ---------------------- + --------------- -------- +
| A | http://popular.com | |
| A | http://popular.com | |
| A | < 9997次以上> | |
| A | http://popular.com | |
| A | http://small-one.com | |
| B | http://tiny.com | |
| B | http://tiny-too.com | |

应该产生以下结果表,其门槛为10,000

  + ------ + ---------------------------- -------- + -------------------------- + 
|键入| url | visit_count | < SUM / COUNT聚合> |
+ ------ + ------------------------------------ + - ------------------------- +
| A | http://popular.com | 10000 | |
| A | | 1 | |
| B | | 2 | |

总结:

Amazon Redshift具有某些子查询相关限制,需要小心处理。 Gordon Linoff的回答如下(接受的答案)显示了如何使用双重聚合来执行GROUP BY CASE表达式,并在结果列和外部GROUP BY子句中复制表达式。

  with temp_counts as(选择类型,url,COUNT(*)作为cnt FROM t GROUP BY类型,url)
选择类型,(当cnt> = 10000时的情况然后url结尾)为url,sum(cnt)as cnt
from temp_counts
group by type,(case when cnt> = 10000 then url end)

进一步测试表明,双重聚合可以展开为涉及每个独立CASE表达式的UNION ALL独立查询。在具有大约200M行的样本数据集的这种特定情况下,这种方法一致地执行速度提高了约30%。然而,结果是架构和数据特定的。

  with temp_counts as(SELECT type,url,COUNT(*)as cnt FROM t GROUP BY类型,url)
select * from temp_counts WHERE cnt> = 10000
UNION ALL
SELECT类型,NULL作为url,SUM(cnt)作为cnt from temp_counts
WHERE cnt< 10000
GROUP BY类型

这表示实现和优化任意分离分组和放大器的两种一般模式;在Amazon Redshift中进行汇总。如果你的表现对你很重要,那么你就可以对两者进行评估。

 选择类型,(以cnt> XXX结尾的情况为url结尾)为url,sum(cnt)为visit_cnt 
from(select type,url ,从t
中按类型计数(*)为cnt
,url
)t
按类型分组(即cnt> XXX后的情况)url
按类型排序,sum(cnt)desc;


In analytics processing there is often a need to collapse "unimportant" groups of data into a single row in the resulting table. One way to do this is to GROUP BY a CASE expression where unimportant groups are coalesced into a single row via the CASE expression returning a single value, e.g., NULL for the groups. This question is about efficient ways to perform this grouping in Amazon Redshift, which is based on ParAccel: close to PosgreSQL 8.0 in terms of functionality.

As an example, consider a GROUP BY on type and url in a table where each row is a single URL visit. The goal is to perform aggregation such that one row is emitted for every (type, url) pair where the URL visit count exceeds a certain threshold and one (type, NULL) row is emitted for all (type, url) pairs where the visit count is under that threshold. The rest of the columns in the result table would have SUM/COUNT aggregates based on this grouping.

For example, the following data

+------+----------------------+-----------------------+
| type | url                  | < 50+ other columns > |
+------+----------------------+-----------------------+
|  A   | http://popular.com   |                       |
|  A   | http://popular.com   |                       |
|  A   | < 9997 more times>   |                       |
|  A   | http://popular.com   |                       |
|  A   | http://small-one.com |                       |
|  B   | http://tiny.com      |                       |
|  B   | http://tiny-too.com  |                       |

should produce the following result table with a threshold of 10,000

+------+------------------------------------+--------------------------+
| type | url                  | visit_count | < SUM/COUNT aggregates > |
+------+------------------------------------+--------------------------+
|  A   | http://popular.com   |       10000 |                          |
|  A   |                      |           1 |                          |
|  B   |                      |           2 |                          |

Summary:

Amazon Redshift has certain subquery correlation limitations one needs to tip-toe around. Gordon Linoff's answer below (the accepted answer) shows how to perform a GROUP BY a CASE expression using double aggregation and replicating the expression in both the result column and the outer GROUP BY clause.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select type, (case when cnt >= 10000 then url end) as url, sum(cnt) as cnt
from temp_counts
group by type, (case when cnt >= 10000 then url end)

Further testing indicated that the double aggregation can be "unrolled" into a UNION ALL of independent queries involving each independent CASE expression. In this particular case on a sample data set with approximately 200M rows, this approach consistently performed about 30% faster. That result is schema and data-specific, however.

with temp_counts as (SELECT type, url, COUNT(*) as cnt FROM t GROUP BY type, url)
select * from temp_counts WHERE cnt >= 10000
UNION ALL
SELECT type, NULL as url, SUM(cnt) as cnt from temp_counts 
WHERE cnt < 10000 
GROUP BY type

This suggests two general patterns for implementing and optimizing arbitrary disjoined grouping & summarization in Amazon Redshift. If performance is important for you, benchmark both.

解决方案

You would do this with two aggregations:

select type, (case when cnt > XXX then url end) as url, sum(cnt) as visit_cnt
from (select type, url, count(*) as cnt
      from t
      group by type, url
     ) t
group by type, (case when cnt > XXX then url end)
order by type, sum(cnt) desc;

这篇关于通过Amazon Redshift / PostgreSQL中的CASE表达式提高效率GROUP BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆