在Spark中使用自定义函数聚合多个列 [英] Aggregating multiple columns with custom function in Spark

查看：322 发布时间：2020/9/4 2:45:52 scala apache-spark dataframe apache-spark-sql orc

本文介绍了在Spark中使用自定义函数聚合多个列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道是否有某种方法可以为多列上的spark数据帧指定自定义聚合函数.

I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns.

我有一个这样的表，类型(名称，项目，价格):

I have a table like this of the type (name, item, price):

john | tomato | 1.99
john | carrot | 0.45
bill | apple  | 0.99
john | banana | 1.29
bill | taco   | 2.59

收件人:

我想将项目及其每个人的费用汇总到这样的列表中:

I would like to aggregate the item and it's cost for each person into a list like this:

john | (tomato, 1.99), (carrot, 0.45), (banana, 1.29)
bill | (apple, 0.99), (taco, 2.59)

这在数据帧中可能吗?我最近了解了collect_list，但是它似乎只适用于一列.

Is this possible in dataframes? I recently learned about collect_list but it appears to only work for one column.

推荐答案

将其作为DataFrame的最简单方法是首先收集两个列表，然后使用UDF来将两个列表一起zip .像这样:

The easiest way to do this as a DataFrame is to first collect two lists, and then use a UDF to zip the two lists together. Something like:

import org.apache.spark.sql.functions.{collect_list, udf}
import sqlContext.implicits._

val zipper = udf[Seq[(String, Double)], Seq[String], Seq[Double]](_.zip(_))

val df = Seq(
  ("john", "tomato", 1.99),
  ("john", "carrot", 0.45),
  ("bill", "apple", 0.99),
  ("john", "banana", 1.29),
  ("bill", "taco", 2.59)
).toDF("name", "food", "price")

val df2 = df.groupBy("name").agg(
  collect_list(col("food")) as "food",
  collect_list(col("price")) as "price" 
).withColumn("food", zipper(col("food"), col("price"))).drop("price")

df2.show(false)
# +----+---------------------------------------------+
# |name|food                                         |
# +----+---------------------------------------------+
# |john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
# |bill|[[apple,0.99], [taco,2.59]]                  |
# +----+---------------------------------------------+

这篇关于在Spark中使用自定义函数聚合多个列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中使用自定义函数聚合多个列 [英] Aggregating multiple columns with custom function in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中使用自定义函数聚合多个列 [英] Aggregating multiple columns with custom function in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭