在数据框 spark 中查找属性组合的前 n 个元素 [英] Find the top n elements for attribute combination in data frame spark

查看：28 发布时间：2021/11/14 23:29:27 apache-spark apache-spark-sql

本文介绍了在数据框 spark 中查找属性组合的前 n 个元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个如下所示的数据框.

I have a data frame like below.

    scala> ds.show
    +----+----------+----------+-----+
    | key|attribute1|attribute2|value|
    +----+----------+----------+-----+
    |mac1|        A1|        B1|   10|
    |mac2|        A2|        B1|   10|
    |mac3|        A2|        B1|   10|
    |mac1|        A1|        B2|   10|
    |mac1|        A1|        B2|   10|
    |mac3|        A1|        B1|   10|
    |mac2|        A2|        B1|   10|
    +----+----------+----------+-----+

对于 attribute1 中的每个值，我想找到前 N 个键和该键的聚合值.输出:属性 1 的键的聚合值将是

For each value in attribute1, I want to find the top N keys and the aggregated value for that key. Output: aggregated value for key for attribute1 will be

    +----+----------+-----+
    | key|attribute1|value|
    +----+----------+-----+
    |mac1|        A1|   30|
    |mac2|        A2|   20|
    |mac3|        A2|   10|
    |mac3|        A1|   10|
    +----+----------+-----+

现在如果 N = 1 那么输出将是 A1 - (mac1,30) A2-(mac2,20)

Now if N = 1 then the output will be A1 - (mac1,30) A2-(mac2,20)

如何在 DataFrame/Dataset 中实现这一点?我想为所有属性实现这一点.在上面的例子中，我也想找到 attribute1 和 attribute2.

How to achieve this in DataFrame/Dataset ? I want to achieve this for all the attributes. In the above example I want to find for attribute1 and attribute2 as well.

推荐答案

Given the input dataframe as

+----+----------+----------+-----+
|key |attribute1|attribute2|value|
+----+----------+----------+-----+
|mac1|A1        |B1        |10   |
|mac2|A2        |B1        |10   |
|mac3|A2        |B1        |10   |
|mac1|A1        |B2        |10   |
|mac1|A1        |B2        |10   |
|mac3|A1        |B1        |10   |
|mac2|A2        |B1        |10   |
+----+----------+----------+-----+

并对上面的输入dataframe做aggregation作为

import org.apache.spark.sql.functions._
val groupeddf = df.groupBy("key", "attribute1").agg(sum("value").as("value"))

应该给你

+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac1|A1        |30.0 |
|mac3|A1        |10.0 |
|mac3|A2        |10.0 |
|mac2|A2        |20.0 |
+----+----------+-----+

现在您可以使用 Window 函数为分组数据中的每一行和 filter 行生成排名，rank <= N 为

now you can use Window function to generate ranks for each row in grouped data and filter rows with rank <= N as

val N = 1

val windowSpec = Window.partitionBy("attribute1").orderBy($"value".desc)

groupeddf.withColumn("rank", rank().over(windowSpec))
  .filter($"rank" <= N)
  .drop("rank")

它应该给你你想要的dataframe.

+----+----------+-----+
|key |attribute1|value|
+----+----------+-----+
|mac2|A2        |20.0 |
|mac1|A1        |30.0 |
+----+----------+-----+

这篇关于在数据框 spark 中查找属性组合的前 n 个元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在数据框 spark 中查找属性组合的前 n 个元素 [英] Find the top n elements for attribute combination in data frame spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在数据框 spark 中查找属性组合的前 n 个元素 [英] Find the top n elements for attribute combination in data frame spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭