将可为空的列作为参数传递给Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

查看：209 发布时间：2020/9/4 3:54:59 apache-spark apache-spark-sql

本文介绍了将可为空的列作为参数传递给Spark SQL UDF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我正在使用的Spark UDF，它使用很少的列来计算值.

Here is a Spark UDF that I'm using to compute a value using few columns.

def spark_udf_func(s: String, i:Int): Boolean = { 
    // I'm returning true regardless of the parameters passed to it.
    true
}

val spark_udf = org.apache.spark.sql.functions.udf(spark_udf_func _)

val df = sc.parallelize(Array[(Option[String], Option[Int])](
  (Some("Rafferty"), Some(31)), 
  (null, Some(33)), 
  (Some("Heisenberg"), Some(33)),  
  (Some("Williams"), null)
)).toDF("LastName", "DepartmentID")

df.withColumn("valid", spark_udf(df.col("LastName"), df.col("DepartmentID"))).show()

+----------+------------+-----+
|  LastName|DepartmentID|valid|
+----------+------------+-----+
|  Rafferty|          31| true|
|      null|          33| true|
|Heisenberg|          33| true|
|  Williams|        null| null|
+----------+------------+-----+

谁能解释为什么有效列的值对于最后一行为空?

Can anyone explain why the value for column valid is null for the last row?

当我检查了火花计划时，我能够确定该计划具有一种情况条件，即如果column2(DepartmentID)为null，则它必须返回null.

When I checked the spark plan I was able to figure that the plan has a case condition where it says if column2 (DepartmentID) is null it has to return null.

== Physical Plan ==

*Project [_1#699 AS LastName#702, _2#700 AS DepartmentID#703, if (isnull(_2#700)) null else UDF(_1#699, _2#700) AS valid#717]
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), assertnotnull(input[0, scala.Tuple2, true])._1), true) AS _1#699, unwrapoption(IntegerType, assertnotnull(input[0, scala.Tuple2, true])._2) AS _2#700]
   +- Scan ExternalRDDScan[obj#698]

为什么我们在Spark中会有这种行为?
为什么只使用整数列?
我在这里做错了什么，当UDF参数为null时，在UDF中处理null的正确方法是什么?

Why do we have such behaviour in Spark?
Why only Integer columns?
What is it that I'm doing wrong here, what is the proper way to handle null's within UDF when the UDF parameter is null?

将可为空的列作为参数传递给Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将可为空的列作为参数传递给Spark SQL UDF [英] Passing nullable columns as parameter to Spark SQL UDF

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭