Spark/Scala 在多列上使用相同的函数重复调用 withColumn() [英] Spark/Scala repeated calls to withColumn() using the same function on multiple columns

查看：79 发布时间：2021/11/14 21:29:07 scala apache-spark dataframe apache-spark-sql user-defined-functions

本文介绍了Spark/Scala 在多列上使用相同的函数重复调用 withColumn()的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前有一些代码，其中我通过多个 .withColumn 链将相同的过程重复应用于多个 DataFrame 列，并且我想创建一个函数来简化该过程.就我而言，我正在查找按键聚合的列的累积总和:

I currently have code in which I repeatedly apply the same procedure to multiple DataFrame Columns via multiple chains of .withColumn, and am wanting to create a function to streamline the procedure. In my case, I am finding cumulative sums over columns aggregated by keys:

val newDF = oldDF
  .withColumn("cumA", sum("A").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumB", sum("B").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumC", sum("C").over(Window.partitionBy("ID").orderBy("time")))
  //.withColumn(...)

我想要的是:

def createCumulativeColums(cols: Array[String], df: DataFrame): DataFrame = {
  // Implement the above cumulative sums, partitioning, and ordering
}

或者更好:

def withColumns(cols: Array[String], df: DataFrame, f: function): DataFrame = {
  // Implement a udf/arbitrary function on all the specified columns
}

推荐答案

您可以将 select 与包括 * 在内的可变参数一起使用:

You can use select with varargs including *:

import spark.implicits._

df.select($"*" +: Seq("A", "B", "C").map(c => 
  sum(c).over(Window.partitionBy("ID").orderBy("time")).alias(s"cum$c")
): _*)

这个:

使用 Seq("A", ...).map(...)
使用 $"*" +: ... 将所有预先存在的列放在前面.
使用 ... : _* 解压组合序列.

Maps columns names to window expressions with Seq("A", ...).map(...)
Prepends all pre-existing columns with $"*" +: ....
Unpacks combined sequence with ... : _*.

并且可以概括为:

import org.apache.spark.sql.{Column, DataFrame}

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 */
def withColumns(cols: Seq[String], df: DataFrame, f: String => Column) =
  df.select($"*" +: cols.map(c => f(c)): _*)

如果您发现 withColumn 语法更具可读性，您可以使用 foldLeft:

If you find withColumn syntax more readable you can use foldLeft:

Seq("A", "B", "C").foldLeft(df)((df, c) =>
  df.withColumn(s"cum$c",  sum(c).over(Window.partitionBy("ID").orderBy("time")))
)

可以概括为例如:

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 * @param name a function mapping from input to output name.
 */
def withColumns(cols: Seq[String], df: DataFrame, 
    f: String =>  Column, name: String => String = identity) =
  cols.foldLeft(df)((df, c) => df.withColumn(name(c), f(c)))

这篇关于Spark/Scala 在多列上使用相同的函数重复调用 withColumn()的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark/Scala 在多列上使用相同的函数重复调用 withColumn() [英] Spark/Scala repeated calls to withColumn() using the same function on multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark/Scala 在多列上使用相同的函数重复调用 withColumn() [英] Spark/Scala repeated calls to withColumn() using the same function on multiple columns

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭