在数据框Spark中找到属性组合的前n个元素 [英] Find the top n elements for attribute combination in data frame spark
问题描述
我有一个如下数据框.
scala> ds.show
+----+----------+----------+-----+
| key|attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1| A1| B1| 10|
|mac2| A2| B1| 10|
|mac3| A2| B1| 10|
|mac1| A1| B2| 10|
|mac1| A1| B2| 10|
|mac3| A1| B1| 10|
|mac2| A2| B1| 10|
+----+----------+----------+-----+
对于attribute1中的每个值,我想找到前N个键和该键的合计值. 输出: 属性1的键的合计值为
For each value in attribute1, I want to find the top N keys and the aggregated value for that key. Output: aggregated value for key for attribute1 will be
+----+----------+-----+
| key|attribute1|value|
+----+----------+-----+
|mac1| A1| 30|
|mac2| A2| 20|
|mac3| A2| 10|
|mac3| A1| 10|
+----+----------+-----+
现在,如果N = 1,则输出将为A1-(mac1,30)A2-(mac2,20)
Now if N = 1 then the output will be A1 - (mac1,30) A2-(mac2,20)
如何在DataFrame/Dataset中实现此目标? 我想针对所有属性实现这一点.在上面的示例中,我也想找到attribute1和attribute2.
How to achieve this in DataFrame/Dataset ? I want to achieve this for all the attributes. In the above example I want to find for attribute1 and attribute2 as well.
推荐答案
输入dataframe
为
+----+----------+----------+-----+
|key |attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
|mac3|A2 |B1 |10 |
|mac1|A1 |B2 |10 |
|mac1|A1 |B2 |10 |
|mac3|A1 |B1 |10 |
|mac2|A2 |B1 |10 |
+----+----------+----------+-----+
并在上述输入dataframe
上以
import org.apache.spark.sql.functions._
val groupeddf = df.groupBy("key", "attribute1").agg(sum("value").as("value"))
应该给您
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac1|A1 |30.0 |
|mac3|A1 |10.0 |
|mac3|A2 |10.0 |
|mac2|A2 |20.0 |
+----+----------+-----+
现在您可以使用Window
函数为分组数据中的每一行和filter
行(以rank <= N
为
now you can use Window
function to generate ranks for each row in grouped data and filter
rows with rank <= N
as
val N = 1
val windowSpec = Window.partitionBy("attribute1").orderBy($"value".desc)
groupeddf.withColumn("rank", rank().over(windowSpec))
.filter($"rank" <= N)
.drop("rank")
这应该给您您想要的dataframe
.
+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac2|A2 |20.0 |
|mac1|A1 |30.0 |
+----+----------+-----+
这篇关于在数据框Spark中找到属性组合的前n个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!