如何最大化值并保留所有列(每个组的最大记录数)? [英] How to max value and keep all columns (for max records per group)?

查看:73
本文介绍了如何最大化值并保留所有列(每个组的最大记录数)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出以下数据框:

+----+-----+---+-----+
| uid|    k|  v|count|
+----+-----+---+-----+
|   a|pref1|  b|  168|
|   a|pref3|  h|  168|
|   a|pref3|  t|   63|
|   a|pref3|  k|   84|
|   a|pref1|  e|   84|
|   a|pref2|  z|  105|
+----+-----+---+-----+

如何从uidk中获取最大值但包含v?

How can I get the max value from uid, k but include v?

+----+-----+---+----------+
| uid|    k|  v|max(count)|
+----+-----+---+----------+
|   a|pref1|  b|       168|
|   a|pref3|  h|       168|
|   a|pref2|  z|       105|
+----+-----+---+----------+

我可以做这样的事情,但是它将删除"v"列:

I can do something like this but it will drop the column "v" :

df.groupBy("uid", "k").max("count")

推荐答案

这是窗口运算符(使用over函数)或join的完美示例.

It's the perfect example for window operators (using over function) or join.

由于您已经了解了如何使用Windows,因此我仅关注join.

Since you've already figured out how to use windows, I focus on join exclusively.

scala> val inventory = Seq(
     |   ("a", "pref1", "b", 168),
     |   ("a", "pref3", "h", 168),
     |   ("a", "pref3", "t",  63)).toDF("uid", "k", "v", "count")
inventory: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 2 more fields]

scala> val maxCount = inventory.groupBy("uid", "k").max("count")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+----------+
|uid|    k|max(count)|
+---+-----+----------+
|  a|pref3|       168|
|  a|pref1|       168|
+---+-----+----------+

scala> val maxCount = inventory.groupBy("uid", "k").agg(max("count") as "max")
maxCount: org.apache.spark.sql.DataFrame = [uid: string, k: string ... 1 more field]

scala> maxCount.show
+---+-----+---+
|uid|    k|max|
+---+-----+---+
|  a|pref3|168|
|  a|pref1|168|
+---+-----+---+

scala> maxCount.join(inventory, Seq("uid", "k")).where($"max" === $"count").show
+---+-----+---+---+-----+
|uid|    k|max|  v|count|
+---+-----+---+---+-----+
|  a|pref3|168|  h|  168|
|  a|pref1|168|  b|  168|
+---+-----+---+---+-----+

这篇关于如何最大化值并保留所有列(每个组的最大记录数)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆