SparkSQL:如何处理用户定义函数中的空值? [英] SparkSQL: How to deal with null values in user defined function?
问题描述
给定表 1,其中一列x"是字符串类型.我想用列y"创建表 2,该列是x"中给出的日期字符串的整数表示.
Given Table 1 with one column "x" of type String. I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x".
必不可少的是在列y"中保留null
值.
表 1(数据框 df1):
Table 1 (Dataframe df1):
+----------+
| x|
+----------+
|2015-09-12|
|2015-09-13|
| null|
| null|
+----------+
root
|-- x: string (nullable = true)
表 2(数据框 df2):
Table 2 (Dataframe df2):
+----------+--------+
| x| y|
+----------+--------+
| null| null|
| null| null|
|2015-09-12|20150912|
|2015-09-13|20150913|
+----------+--------+
root
|-- x: string (nullable = true)
|-- y: integer (nullable = true)
而将x"列的值转换为y"列的值的用户定义函数 (udf) 是:
While the user-defined function (udf) to convert values from column "x" into those of column "y" is:
val extractDateAsInt = udf[Int, String] (
(d:String) => d.substring(0, 10)
.filterNot( "-".toSet)
.toInt )
并且有效,处理空值是不可能的.
and works, dealing with null values is not possible.
尽管如此,我可以做一些类似的事情
Even though, I can do something like
val extractDateAsIntWithNull = udf[Int, String] (
(d:String) =>
if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
else 1 )
我发现没有办法通过udfs产生"null
值(当然,因为Int
s不能是null
).
I have found no way, to "produce" null
values via udfs (of course, as Int
s can not be null
).
我目前创建 df2(表 2)的解决方案如下:
My current solution for creation of df2 (Table 2) is as follows:
// holds data of table 1
val df1 = ...
// filter entries from df1, that are not null
val dfNotNulls = df1.filter(df1("x")
.isNotNull)
.withColumn("y", extractDateAsInt(df1("x")))
.withColumnRenamed("x", "right_x")
// create df2 via a left join on df1 and dfNotNull having
val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")
问题:
- 当前的解决方案似乎很麻烦(并且可能效率不高.性能).有没有更好的方法?
- @Spark-developers:是否有计划/可用的
NullableInt
类型,使得以下 udf 是可能的(请参阅代码摘录)?
- The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?
- @Spark-developers: Is there a type
NullableInt
planned / avaiable, such that the following udf is possible (see Code excerpt ) ?
代码摘录
val extractDateAsNullableInt = udf[NullableInt, String] (
(d:String) =>
if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
else null )
推荐答案
这就是 Option
派上用场的地方:
This is where Option
comes in handy:
val extractDateAsOptionInt = udf((d: String) => d match {
case null => None
case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)
})
或者在一般情况下使其稍微更安全:
or to make it slightly more secure in general case:
import scala.util.Try
val extractDateAsOptionInt = udf((d: String) => Try(
d.substring(0, 10).filterNot("-".toSet).toInt
).toOption)
所有功劳归于 Dmitriy Selivanov,他指出此解决方案是(缺失?)编辑此处.
All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here.
替代方法是在 UDF 之外处理 null
:
Alternative is to handle null
outside the UDF:
import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType
val extractDateAsInt = udf(
(d: String) => d.substring(0, 10).filterNot("-".toSet).toInt
)
df.withColumn("y",
when($"x".isNull, lit(null))
.otherwise(extractDateAsInt($"x"))
.cast(IntegerType)
)
这篇关于SparkSQL:如何处理用户定义函数中的空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!