spark表达式在聚合后重命名列列表 [英] spark expression rename the column list after aggregation

查看:535
本文介绍了spark表达式在聚合后重命名列列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了以下代码来分组和汇总列

I have written below code to group and aggregate the columns

 val gmList = List("gc1","gc2","gc3")
 val aList = List("val1","val2","val3","val4","val5")

 val cype = "first"

 val exprs = aList.map((_ -> cype )).toMap

 dfgroupBy(gmList.map (col): _*).agg (exprs).show

但这会创建一个在所有列中附加聚合名称的列,如下所示

but this create a columns with appending aggregation name in all column as shown below

所以我想给名字first(val1)-> val1加上别名,我​​想使这段代码成为exprs的一部分

so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs

  +----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
 |    gc1   |  gc2     | gc3         |        first(val1)      |      first(val2)|       first(val3)          |       first(val4)      |       first(val5) |
 +----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+

推荐答案

一种方法是将聚合列别名为后续select中的原始列名称.我还建议将单个聚合函数(即first)概括为函数列表,如下所示:

One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:

import org.apache.spark.sql.functions._

val df = Seq(
  (1, 10, "a1", "a2", "a3"),
  (1, 10, "b1", "b2", "b3"),
  (2, 20, "c1", "c2", "c3"),
  (2, 30, "d1", "d2", "d3"),
  (2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")

val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")

// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")

val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))

df.
  groupBy(gmList.map(col): _*).agg(afPairs.toMap).
  select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
  show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// |  2| 20|  c1|  c2|  c3|
// |  1| 10|  a1|  a2|  a3|
// |  2| 30|  d1|  d2|  d3|
// +---+---+----+----+----+

这篇关于spark表达式在聚合后重命名列列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆