如何有效地映射DF并使用输出组合? [英] How to efficiently map over DF and use combination of outputs?

查看:157
本文介绍了如何有效地映射DF并使用输出组合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个DF,假设我有3个类,每个类都有一个方法addCol,该方法将使用DF中的列来创建新列并将其附加到DF(基于不同的计算).

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).

获得包含原始df A和添加的3列的df的最佳方法是什么?

What is the best way to get a resulting df that will contain the original df A and the 3 added columns?

val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")

def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method3", col("num1")+col("num2"))
}

一个选项是actions.foldLeft(df) { (df, action) => action.addCol(df))}.最终结果是我想要的DF-具有列num1num2method1method2method3.但是据我了解,这将不会使用分布式评估,并且每个addCol都将顺序发生.什么是更有效的方法?

One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?

推荐答案

有效的方法是使用select.

select会比foldLeft快-您可以构建所需的表达式&在select中使用该代码,请检查以下代码.

You can build required expressions & use that inside select, check below code.

scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1   |2   |
|2   |5   |
|3   |7   |
+----+----+

scala> val colExpr = Seq(
                          $"num1",
                          $"num2",
                          ($"num1"/$"num2").as("method1"),
                          ($"num1" * $"num2").as("method2"),
                          ($"num1" + $"num2").as("method3")
)

最终输出

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

更新

返回Column而不是DataFrame.尝试使用高阶函数,您的所有三个函数都可以用下面的一个函数代替.

Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.

scala> def add(
               num1:Column, // May be you can try to use variable args here if you want.
               num2:Column,
               f: (Column,Column) => Column
             ): Column = f(num1,num2)

例如,varargs&在调用此方法时,您需要在末尾传递必需的列.

For Example, varargs & while invoking this method you need to pass required columns at the end.

def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)

调用add函数.

scala> val colExpr = Seq(
    $"num1",
    $"num2",
    add($"num1",$"num2",(_ / _)).as("method1"),
    add($"num1", $"num2",(_ * _)).as("method2"),
    add($"num1", $"num2",(_ + _)).as("method3")
)

最终输出

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

这篇关于如何有效地映射DF并使用输出组合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆