将可为空的列作为参数传递给 Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

查看：34 发布时间：2021/11/14 22:02:15 apache-spark apache-spark-sql

本文介绍了将可为空的列作为参数传递给 Spark SQL UDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是一个 Spark UDF，我用它来计算使用几列的值.

Here is a Spark UDF that I'm using to compute a value using few columns.

def spark_udf_func(s: String, i:Int): Boolean = { 
    // I'm returning true regardless of the parameters passed to it.
    true
}

val spark_udf = org.apache.spark.sql.functions.udf(spark_udf_func _)

val df = sc.parallelize(Array[(Option[String], Option[Int])](
  (Some("Rafferty"), Some(31)), 
  (null, Some(33)), 
  (Some("Heisenberg"), Some(33)),  
  (Some("Williams"), null)
)).toDF("LastName", "DepartmentID")

df.withColumn("valid", spark_udf(df.col("LastName"), df.col("DepartmentID"))).show()

+----------+------------+-----+
|  LastName|DepartmentID|valid|
+----------+------------+-----+
|  Rafferty|          31| true|
|      null|          33| true|
|Heisenberg|          33| true|
|  Williams|        null| null|
+----------+------------+-----+

谁能解释为什么最后一行的有效列的值为空?

Can anyone explain why the value for column valid is null for the last row?

当我检查 spark 计划时，我能够确定该计划有一个案例条件，它表示如果 column2 (DepartmentID) 为 null，则它必须返回 null.

When I checked the spark plan I was able to figure that the plan has a case condition where it says if column2 (DepartmentID) is null it has to return null.

== Physical Plan ==

*Project [_1#699 AS LastName#702, _2#700 AS DepartmentID#703, if (isnull(_2#700)) null else UDF(_1#699, _2#700) AS valid#717]
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), assertnotnull(input[0, scala.Tuple2, true])._1), true) AS _1#699, unwrapoption(IntegerType, assertnotnull(input[0, scala.Tuple2, true])._2) AS _2#700]
   +- Scan ExternalRDDScan[obj#698]

为什么我们在 Spark 中有这样的行为?
为什么只有整数列?
我在这里做错了什么，当 UDF 参数为空时，在 UDF 中处理空值的正确方法是什么?

Why do we have such behaviour in Spark?
Why only Integer columns?
What is it that I'm doing wrong here, what is the proper way to handle null's within UDF when the UDF parameter is null?

将可为空的列作为参数传递给 Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将可为空的列作为参数传递给 Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭