Spark UDF无法在Double字段中使用空值 [英] Spark UDF not working with null values in Double field
问题描述
我正在尝试编写一个火花UDF,将Double字段的空值替换为0.0.我正在使用数据集API.这是UDF:
I'm trying to write a spark UDF that replaces the null values of a Double field with 0.0. I'm using the Dataset API. Here's the UDF:
val coalesceToZero=udf((rate: Double) => if(Option(rate).isDefined) rate else 0.0)
这基于我测试正常运行的以下功能:
This is based on the following function that I tested to be working fine:
def cz(value: Double): Double = if(Option(value).isDefined) value else 0.0
cz(null.asInstanceOf[Double])
cz: (value: Double)Double
res15: Double = 0.0
但是当我以以下方式在Spark中使用它时,UDF无法正常工作.
But when I use it in Spark in the following manner the UDF doesn't work.
myDS.filter($"rate".isNull)
.select($"rate", coalesceToZero($"rate")).show
+----+---------+
|rate|UDF(rate)|
+----+---------+
|null| null|
|null| null|
|null| null|
|null| null|
|null| null|
|null| null|
+----+---------+
但是,以下方法可行:
val coalesceToZero=udf((rate: Any) => if(rate == null) 0.0 else rate.asInstanceOf[Double])
所以我想知道Spark是否具有一些特殊的方式来处理null Double值.
So I was wondering if Spark has some special way of handling null Double values.
推荐答案
scala.Double
不能为 null
,并且您使用的函数似乎仅能工作,原因是:
scala.Double
cannot be null
and the function you use, seems to work only because:
scala> null.asInstanceOf[Double]
res2: Double = 0.0
(您可以在找到一个很好的答案来描述此行为,如果Int不能为null,则null.asInstanceOf [Int]是什么是什么意思?).
如果 myDS
是静态类型的数据集,则正确的方法是使用 Option [Double]
:
If myDS
is a statically typed dataset the right way is to use either use Option[Double]
:
case class MyCaseClass(rate: Option[Double])
或 java.lang.Double
:
case class MyCaseClass(rate: java.lang.Double)
这两种方法都允许您使用静态类型的API(不是SQL/ DataFrame
)处理 nulls
,从性能的角度来看,后者表示是有利的.
Either of these would allow you to handle nulls
with statically typed API (not SQL / DataFrame
) with the latter representation being favorable from the performance perspective.
通常,我建议使用SQL API填充 NULL
:
In general, I'd recommend filling NULLs
using SQL API:
import org.apache.spark.sql.functions.{coalesce, lit}
myDS.withColumn("rate", coalesce($"rate", lit(0.0)))
或 DataFrameNaFunctions.fill
:
df.na.fill(0.0, Seq("rate"))
在将 Dataset [Row]
转换为 Dataset [MyCaseClass]
之前.
这篇关于Spark UDF无法在Double字段中使用空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!