SparkSQL:如何处理用户定义函数中的空值? [英] SparkSQL: How to deal with null values in user defined function?

查看:31
本文介绍了SparkSQL:如何处理用户定义函数中的空值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定表 1,其中一列x"是字符串类型.我想用列y"创建表 2,该列是x"中给出的日期字符串的整数表示.

Given Table 1 with one column "x" of type String. I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x".

必不可少的是在列y"中保留null值.

表 1(数据框 df1):

Table 1 (Dataframe df1):

+----------+
|         x|
+----------+
|2015-09-12|
|2015-09-13|
|      null|
|      null|
+----------+
root
 |-- x: string (nullable = true)

表 2(数据框 df2):

Table 2 (Dataframe df2):

+----------+--------+                                                                  
|         x|       y|
+----------+--------+
|      null|    null|
|      null|    null|
|2015-09-12|20150912|
|2015-09-13|20150913|
+----------+--------+
root
 |-- x: string (nullable = true)
 |-- y: integer (nullable = true)

而将x"列的值转换为y"列的值的用户定义函数 (udf) 是:

While the user-defined function (udf) to convert values from column "x" into those of column "y" is:

val extractDateAsInt = udf[Int, String] (
  (d:String) => d.substring(0, 10)
      .filterNot( "-".toSet)
      .toInt )

并且有效,处理空值是不可能的.

and works, dealing with null values is not possible.

尽管如此,我可以做一些类似的事情

Even though, I can do something like

val extractDateAsIntWithNull = udf[Int, String] (
  (d:String) => 
    if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt 
    else 1 )

我发现没有办法通过udfs产生"null值(当然,因为Ints不能是null).

I have found no way, to "produce" null values via udfs (of course, as Ints can not be null).

我目前创建 df2(表 2)的解决方案如下:

My current solution for creation of df2 (Table 2) is as follows:

// holds data of table 1  
val df1 = ... 

// filter entries from df1, that are not null
val dfNotNulls = df1.filter(df1("x")
  .isNotNull)
  .withColumn("y", extractDateAsInt(df1("x")))
  .withColumnRenamed("x", "right_x")

// create df2 via a left join on df1 and dfNotNull having 
val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")

问题:

  • 当前的解决方案似乎很麻烦(并且可能效率不高.性能).有没有更好的方法?
  • @Spark-developers:是否有计划/可用的 NullableInt 类型,使得以下 udf 是可能的(请参阅代码摘录)?
  • The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?
  • @Spark-developers: Is there a type NullableInt planned / avaiable, such that the following udf is possible (see Code excerpt ) ?

代码摘录

val extractDateAsNullableInt = udf[NullableInt, String] (
  (d:String) => 
    if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt 
    else null )

推荐答案

这就是 Option 派上用场的地方:

This is where Optioncomes in handy:

val extractDateAsOptionInt = udf((d: String) => d match {
  case null => None
  case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)
})

或者在一般情况下使其稍微更安全:

or to make it slightly more secure in general case:

import scala.util.Try

val extractDateAsOptionInt = udf((d: String) => Try(
  d.substring(0, 10).filterNot("-".toSet).toInt
).toOption)

所有功劳归于 Dmitriy Selivanov,他指出此解决方案是(缺失?)编辑此处.

All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here.

替代方法是在 UDF 之外处理 null:

Alternative is to handle null outside the UDF:

import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType

val extractDateAsInt = udf(
   (d: String) => d.substring(0, 10).filterNot("-".toSet).toInt
)

df.withColumn("y",
  when($"x".isNull, lit(null))
    .otherwise(extractDateAsInt($"x"))
    .cast(IntegerType)
)

这篇关于SparkSQL:如何处理用户定义函数中的空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆