通过使用数据帧中多个其他列的值向数据帧添加新列 - spark/scala [英] Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala

查看:31
本文介绍了通过使用数据帧中多个其他列的值向数据帧添加新列 - spark/scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark SQL 和 Dataframes 的新手.我有一个 Dataframe,我应该根据其他列的值向其中添加一个新列.我有一个来自 excel 的 Nested IF 公式,我应该实现它(用于向新列添加值),当转换为编程术语时,它是这样的:

I am new to spark SQL and Dataframes. I have a Dataframe to which I should be adding a new column based on the values of other columns. I have a Nested IF formula from excel that I should be implementing (for adding values to the new column), which when converted into programmatic terms, is something like this:

if(k =='yes')
{
  if(!(i==''))
  {
    if(diff(max_date, target_date) < 0)
    {
      if(j == '')
      {
        "pending" //the value of the column
      }
      else {
        "approved" //the value of the column
      }
    }
    else{
      "expired" //the value of the column
    }
  }
  else{
    "" //the value should be empty
  }
}
else{
  "" //the value should be empty
} 

i,j,k 是 Dataframe 中的其他三个列. 我知道我们可以使用 withColumnwhen 添加基于的新列在其他专栏上,但我不确定如何使用该方法实现上述逻辑.

i,j,k are three other columns in the Dataframe. I know we can use withColumn and when to add new columns based on other columns, but I am not sure how I can achieve the above logic using that approach.

实现上述添加新列的逻辑的简单/有效方法是什么?任何帮助将不胜感激.

what would be an easy/efficient way to implement the above logic for adding the new column? Any help would be appreciated.

谢谢.

推荐答案

首先,让我们简化 if 语句:

First thing, lets simplify that if statement:

if(k == "yes" && i.nonEmpty)
  if(maxDate - targetDate < 0)
    if (j.isEmpty) "pending" 
    else "approved"
  else "expired"
else ""

现在有两种主要方法可以实现这一点

Now there are 2 main ways to accomplish this

  1. 使用自定义 UDF
  2. 使用 spark 内置函数:coalescewhenotherwise

自定义 UDF

现在,由于条件的复杂性,执行第 2 项会相当棘手.使用自定义 UDF 应该可以满足您的需求.

Custom UDF

Now due to the complexity of your conditions, it will be rather tricky to do number 2. Using a custom UDF should suit your needs.

def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =  
  if(k == "yes" && i.nonEmpty)
    if(maxDate - targetDate < 0)
      if (j.isEmpty) "pending" 
      else "approved"
    else "expired"
  else ""

val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))

只需将 lit(0) 和 lit(0) 更改为您的日期代码,这应该对您有用.

Just change lit(0) and lit(0) to your date code, and this should work for you.

如果您发现性能问题,您可以切换到使用 coalesceotherwisewhen,它们看起来像这样:

If you notice performance issues, you can switch to using coalesce, otherwise, and when, which would look something like this:

val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))

我过去在大型输入源中使用过自定义 udf 没有问题,并且自定义 udf 可以产生更具可读性的代码,尤其是在这种情况下.

I've used custom udf's in the past with large input sources without issues, and custom udfs can lead to much more readable code, especially in this case.

这篇关于通过使用数据帧中多个其他列的值向数据帧添加新列 - spark/scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆