如何在Spark Scala中为每个类别选择N个最大值 [英] How to select the N highest values for each category in spark scala

查看:97
本文介绍了如何在Spark Scala中为每个类别选择N个最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有这个数据集:

  val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
    ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")

看起来像这样:

我想着眼于各队的专栏,对于其他所有专栏,返回该专栏的2(或N)个最高值.因此,对于yankees-mets和本垒打,它将返回此值,

I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,

由于本垒打的2个最高本垒打总数分别是8和6.

Since the 2 highest homerun totals for them were 8 and 6.

一般情况下我该怎么做?

How would I do this in the general case?

谢谢

推荐答案

由于

枢轴是一种聚合,其中一个分组列(通常情况下是多个)将其不同的值转换为各个列.

A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.

您可以使用窗口函数创建一个附加的等级列,然后仅选择等级1或2的行.

You could create an additional rank column with a window function and then select only rows with rank 1 or 2:

import org.apache.spark.sql.expressions.Window

main_df.withColumn(
  "rank", 
  rank()
  .over(
    Window.partitionBy("teams")
    .orderBy($"homeruns".desc)
  )
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show

+------------+--------+----+----+
|       teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets|       8|  20|   1|
|yankees-mets|       6|  17|   2|
+------------+--------+----+----+

然后,如果您不再需要 rank 列,则可以将其删除.

Then if you no longer need rank column you could just drop it.

这篇关于如何在Spark Scala中为每个类别选择N个最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆