通过使用数据框中的其他多个列的值将新列添加到数据框中-Spark/Scala [英] Adding a new column to a Dataframe by using the values of multiple other columns in the dataframe - spark/scala

查看:63
本文介绍了通过使用数据框中的其他多个列的值将新列添加到数据框中-Spark/Scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是激发SQL和Dataframes的新手.我有一个Dataframe,我应该根据其他列的值向其中添加一个新列.我有一个应该从excel实现的Nested IF公式(用于将值添加到新列),当转换为程序化术语时,它就像这样:

I am new to spark SQL and Dataframes. I have a Dataframe to which I should be adding a new column based on the values of other columns. I have a Nested IF formula from excel that I should be implementing (for adding values to the new column), which when converted into programmatic terms, is something like this:

if(k =='yes')
{
  if(!(i==''))
  {
    if(diff(max_date, target_date) < 0)
    {
      if(j == '')
      {
        "pending" //the value of the column
      }
      else {
        "approved" //the value of the column
      }
    }
    else{
      "expired" //the value of the column
    }
  }
  else{
    "" //the value should be empty
  }
}
else{
  "" //the value should be empty
} 

i,j,k are three other columns in the Dataframe.我知道我们可以使用withColumnwhen来基于其他列添加新列,但是我不确定如何使用该方法来实现上述逻辑.

i,j,k are three other columns in the Dataframe. I know we can use withColumn and when to add new columns based on other columns, but I am not sure how I can achieve the above logic using that approach.

实现上述用于添加新列的逻辑的简便/有效方法是什么?任何帮助将不胜感激.

what would be an easy/efficient way to implement the above logic for adding the new column? Any help would be appreciated.

谢谢.

推荐答案

首先,让我们简化一下if语句:

First thing, lets simplify that if statement:

if(k == "yes" && i.nonEmpty)
  if(maxDate - targetDate < 0)
    if (j.isEmpty) "pending" 
    else "approved"
  else "expired"
else ""

现在有两种主要方法可以实现这一目标

Now there are 2 main ways to accomplish this

  1. 使用自定义UDF
  2. 使用内置的spark功能:coalescewhenotherwise
  1. Using a custom UDF
  2. Using spark built in functions: coalesce, when, otherwise

自定义UDF

现在,由于条件的复杂性,执行第2个操作将非常棘手.使用自定义UDF应该可以满足您的需求.

Custom UDF

Now due to the complexity of your conditions, it will be rather tricky to do number 2. Using a custom UDF should suit your needs.

def getState(i: String, j: String, k: String, maxDate: Long, targetDate: Long): String =  
  if(k == "yes" && i.nonEmpty)
    if(maxDate - targetDate < 0)
      if (j.isEmpty) "pending" 
      else "approved"
    else "expired"
  else ""

val stateUdf = udf(getState _)
df.withColumn("state", stateUdf($"i",$"j",$"k",lit(0),lit(0)))

只需将lit(0)和lit(0)更改为您的日期代码,这对您就可以使用.

Just change lit(0) and lit(0) to your date code, and this should work for you.

如果发现性能问题,则可以切换为使用coalesceotherwisewhen,它们看起来像这样:

If you notice performance issues, you can switch to using coalesce, otherwise, and when, which would look something like this:

val isApproved = df.withColumn("state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" =!= "", "approved").otherwise(null))
val isPending = isApproved.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) < 0) && $"j" === "", "pending").otherwise(null)))
val isExpired = isPending.withColumn("state", coalesce($"state", when($"k" === "yes" && $"i" =!= "" && (lit(max_date) - lit(target_date) >= 0), "expired").otherwise(null)))
val finalDf = isExpired.withColumn("state", coalesce($"state", lit("")))

过去,我在大型输入源中使用自定义udf时没有问题,自定义udfs可以导致代码更具可读性,尤其是在这种情况下.

I've used custom udf's in the past with large input sources without issues, and custom udfs can lead to much more readable code, especially in this case.

这篇关于通过使用数据框中的其他多个列的值将新列添加到数据框中-Spark/Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆