Spark UDF无法在Double字段中使用空值 [英] Spark UDF not working with null values in Double field

查看：168 发布时间：2021/4/8 19:53:55 scala apache-spark apache-spark-dataset

本文介绍了Spark UDF无法在Double字段中使用空值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试编写一个火花UDF，将Double字段的空值替换为0.0.我正在使用数据集API.这是UDF:

I'm trying to write a spark UDF that replaces the null values of a Double field with 0.0. I'm using the Dataset API. Here's the UDF:

val coalesceToZero=udf((rate: Double) =>  if(Option(rate).isDefined) rate else 0.0)

这基于我测试正常运行的以下功能:

This is based on the following function that I tested to be working fine:

def cz(value: Double): Double = if(Option(value).isDefined) value else 0.0

cz(null.asInstanceOf[Double])
cz: (value: Double)Double
res15: Double = 0.0

但是当我以以下方式在Spark中使用它时，UDF无法正常工作.

But when I use it in Spark in the following manner the UDF doesn't work.

myDS.filter($"rate".isNull)
    .select($"rate", coalesceToZero($"rate")).show

+----+---------+
|rate|UDF(rate)|
+----+---------+
|null|     null|
|null|     null|
|null|     null|
|null|     null|
|null|     null|
|null|     null|
+----+---------+

但是，以下方法可行:

val coalesceToZero=udf((rate: Any) =>  if(rate == null) 0.0 else rate.asInstanceOf[Double])

所以我想知道Spark是否具有一些特殊的方式来处理null Double值.

So I was wondering if Spark has some special way of handling null Double values.

推荐答案

scala.Double 不能为 null ，并且您使用的函数似乎仅能工作，原因是:

scala.Double cannot be null and the function you use, seems to work only because:

scala> null.asInstanceOf[Double]
res2: Double = 0.0

(您可以在找到一个很好的答案来描述此行为，如果Int不能为null，则null.asInstanceOf [Int]是什么是什么意思?).

如果 myDS 是静态类型的数据集，则正确的方法是使用 Option [Double] :

If myDS is a statically typed dataset the right way is to use either use Option[Double]:

case class MyCaseClass(rate: Option[Double])

或 java.lang.Double :

case class MyCaseClass(rate: java.lang.Double)

这两种方法都允许您使用静态类型的API(不是SQL/ DataFrame )处理 nulls ，从性能的角度来看，后者表示是有利的.

Either of these would allow you to handle nulls with statically typed API (not SQL / DataFrame) with the latter representation being favorable from the performance perspective.

通常，我建议使用SQL API填充 NULL :

In general, I'd recommend filling NULLs using SQL API:

import org.apache.spark.sql.functions.{coalesce, lit}

myDS.withColumn("rate", coalesce($"rate", lit(0.0)))

或 DataFrameNaFunctions.fill :

df.na.fill(0.0, Seq("rate"))

在将 Dataset [Row] 转换为 Dataset [MyCaseClass] 之前.

这篇关于Spark UDF无法在Double字段中使用空值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark UDF无法在Double字段中使用空值 [英] Spark UDF not working with null values in Double field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark UDF无法在Double字段中使用空值 [英] Spark UDF not working with null values in Double field

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭