Spark sql 每组前 n 个 [英] Spark sql top n per group

查看:55
本文介绍了Spark sql 每组前 n 个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 spark-sql 中获得每个组的前 n(比如前 10 名或前 3 名)?

每个品类中最畅销和第二畅销的产品是什么的答案如下

SELECT product,category,revenue FROM(选择产品,类别,收入,dense_rank()OVER(PARTITION BY category ORDER BY Revenue DESC)作为排名FROM productRevenue) tmpWHERE 等级 <= 2

这会给你想要的结果

How can I get the top-n (lets say top 10 or top 3) per group in spark-sql?

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ provides a tutorial for general SQL. However, spark does not implement subqueries in the where clause.

解决方案

You can use the window function feature that was added in Spark 1.4 Suppose that we have a productRevenue table as shown below.

the answer to What are the best-selling and the second best-selling products in every category is as follows

SELECT product,category,revenue FROM 
   (SELECT product,category,revenue,dense_rank() 
         OVER (PARTITION BY category ORDER BY revenue DESC) as rank 
    FROM productRevenue) tmp 
WHERE rank <= 2

Tis will give you the desired result

这篇关于Spark sql 每组前 n 个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆