SPARK数据框:选择每组的第一行 [英] SPARK DataFrame: select the first row of each group
问题描述
我有如下生成的数据框:
I have a DataFrame generated as follow:
df.groupBy($"Hour", $"Category")
.agg(sum($"value").alias("TotalValue"))
.sort($"Hour".asc,$"TotalValue".desc))
结果如下:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 0| cat26| 30.9|
| 0| cat13| 22.1|
| 0| cat95| 19.6|
| 0| cat105| 1.3|
| 1| cat67| 28.5|
| 1| cat4| 26.8|
| 1| cat13| 12.6|
| 1| cat23| 5.3|
| 2| cat56| 39.6|
| 2| cat40| 29.7|
| 2| cat187| 27.9|
| 2| cat68| 9.8|
| 3| cat8| 35.6|
| ...| ....| ....|
+----+--------+----------+
正如你所看到的,数据帧是由小时
在递增的顺序在排序,然后按总价值
降序排列。
As you can see, the DataFrame is ordered by Hour
in an increasing order, then by TotalValue
in a descending order.
我想选择各个组,即最上面一行。
I would like to select the top row of each group, i.e.
- 等
从小组小时== 0选择(0,cat26,30.9)的
从组小时== 1选择(1,cat67,28.5)的
从小组小时== 2选择(2 cat56,39.6)的
因此所需的输出将是:
So the desired output would be:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 0| cat26| 30.9|
| 1| cat67| 28.5|
| 2| cat56| 39.6|
| 3| cat8| 35.6|
| ...| ...| ...|
+----+--------+----------+
这可能是很方便的可以选择每个组的顶部N行也是如此。
It might be handy to be able to select the top N rows of each group as well.
任何帮助是非常AP preciated。
Any help is highly appreciated.
推荐答案
窗口函数
这样的事情应该做的伎俩:
Something like this should do the trick:
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = sc.parallelize(Seq(
(0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
(1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
(2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
(3,"cat8",35.6))).toDF("Hour", "Category", "TotalValue")
val w = Window.partitionBy($"hour").orderBy($"TotalValue".desc)
val dfTop = df.withColumn("rn", rowNumber.over(w)).where($"rn" === 1).drop("rn")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
普通SQL聚集,然后按加入
Plain SQL aggregation followed by join
:
另外,您可以使用汇总数据帧加入:
Alternatively you can join with aggregated data frame:
val dfMax = df.groupBy($"hour").agg(max($"TotalValue"))
val dfTopByJoin = df.join(broadcast(dfMax),
($"hour" === $"max_hour") && ($"TotalValue" === $"max_value"))
.drop("max_hour")
.drop("max_value")
dfTopByJoin.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
有将保持重复值(如果有每小时多个类别具有相同的总值)。您可以删除这些如下:
It will keep duplicate values (if there is more than one category per hour with the same total value). You can remove these as follows:
dfTopByJoin
.groupBy($"hour")
.agg(
first("category").alias("category"),
first("TotalValue").alias("TotalValue"))
使用订货超过结构
Using ordering over structs
:
整齐,虽然不是很好的测试,其不需要特技联接或窗口功能:
Neat, although not very well tested, trick which doesn't require joins or window functions:
val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs"))
.groupBy($"hour")
.agg(max("vs").alias("vs"))
.select($"Hour", $"vs.Category", $"vs.TotalValue")
dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// | 0| cat26| 30.9|
// | 1| cat67| 28.5|
// | 2| cat56| 39.6|
// | 3| cat8| 35.6|
// +----+--------+----------+
随着数据集API (星火1.6+,2.0 +):
With DataSet API (Spark 1.6+, 2.0+):
case class Record(Hour: Integer, Category: String, TotalValue: Double)
df.as[Record]
.groupBy($"hour")
.reduce((x, y) => if (x.TotalValue > y.TotalValue) x else y)
.show
// +---+--------------+
// | _1| _2|
// +---+--------------+
// |[0]|[0,cat26,30.9]|
// |[1]|[1,cat67,28.5]|
// |[2]|[2,cat56,39.6]|
// |[3]| [3,cat8,35.6]|
// +---+--------------+
这篇关于SPARK数据框:选择每组的第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!