如何在Spark 1.6 UDF中过滤可为空的数组元素 [英] How to filter nullable Array-Elements in Spark 1.6 UDF

查看:234
本文介绍了如何在Spark 1.6 UDF中过滤可为空的数组元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下DataFrame

Consider the following DataFrame

root
 |-- values: array (nullable = true)
 |    |-- element: double (containsNull = true)

内容:

+-----------+
|     values|
+-----------+
|[1.0, null]|
+-----------+

现在我想将value列传递给UDF:

Now I want to pass thie value column to an UDF:

val inspect = udf((data:Seq[Double]) => {
  data.foreach(println)
  println()
  data.foreach(d => println(d))
  println()
  data.foreach(d => println(d==null))
  ""
})

df.withColumn("dummy",inspect($"values"))

我真的对上面的println语句的输出感到困惑:

I'm really confused from the output of the above println statements:

1.0
null

1.0
0.0

false
false

我的问题:

  1. 为什么foreach(println)不能提供与foreach(d=>println(d))相同的输出?
  2. 在第一个println语句中Double如何为空,我以为scala的Double不能为空?
  3. 如何在我的Seq其他韩式过滤0.0中过滤空值,这不是很安全?我应该在UDF中使用Seq[java.lang.Double]作为输入的类型,然后过滤null吗? (这可行,但是我不确定这是否可行)
  1. Why is foreach(println) not giving the same output as foreach(d=>println(d))?
  2. How can the Double be null in the first println-statement, I thought scala's Double cannot be null?
  3. How can I filter null values in my Seq other han filtering 0.0 which isnt really safe? Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls? (this works, but I'm unsure if that is the way to go)

请注意,我知道这个问题,但是我的问题特定于数组类型的列.

Note that I'm aware of this Question, but my question is specific to array-type columns.

推荐答案

为什么foreach(println)不能提供与foreach(d => println(d))相同的输出?

Why is foreach(println) not giving the same output as foreach(d=>println(d))?

在期望Any的情况下,将完全跳过数据转换. 如果Int不能为null,则null.asInstanceOf [Int]是什么意思?

In the context where Any is expected data cast is skipped completely. This is explained in detail in If an Int can't be null, what does null.asInstanceOf[Int] mean?

在第一个println语句中,Double如何为null,我以为scala的Double不能为null?

How can the Double be null in the first println-statement, I thought scala's Double cannot be null?

内部二进制表示形式根本不使用Scala类型.数组数据解码后,将表示为Array[Any],并且使用简单的asInstanceOf将元素强制为已声明的类型.

Internal binary representation doesn't use Scala types at all. Once array data is decoded it is represented as an Array[Any] and elements are coerced to a declared type with simple asInstanceOf.

我应该使用Seq [java.lang.Double]作为我在UDF中输入的类型,然后过滤空值吗?

Should I use Seq[java.lang.Double] as type for my input in the UDF and then filter nulls?

通常,如果值可为空,则应使用也可为空的外部类型或Option.不幸的是,只有第一个选项适用于UDF.

In general if values are nullable then you should use external type which is nullable as well or Option. Unfortunately only the first option is applicable for UDFs.

这篇关于如何在Spark 1.6 UDF中过滤可为空的数组元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆