Spark数据集减少为空值? [英] Spark dataset reduce with null values?

查看:63
本文介绍了Spark数据集减少为空值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码创建数据框:

I'm creating data frame with this code:

  val data = List(
    List(444.1235D),
    List(67.5335D),
    List(69.5335D),
    List(677.5335D),
    List(47.5335D),
    List(null)
  )

  val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
  val schema = StructType(Array(
    StructField("value", DataTypes.DoubleType, true)
  ))

  val df = sqlContext.createDataFrame(rdd, schema)

然后我将udf应用于它:

Then I apply my udf to it:

val multip: Dataset[Double] = df.select(doubleUdf(df("value"))).as[Double]

然后我尝试在该数据集上使用reduce:

and then I'm trying to use reduce on this dataset:

val multipl = multip.reduce(_ * _)

在这里,我得到了0.0.我也尝试过滤掉空值

And here I got 0.0 as a result. Also I tried to filter nulls out

val multipl = multip.filter(_ != null).reduce(_ * _)

具有相同的结果.如果我从数据中删除null值,一切都会正常进行.如何使用空值进行reduce工作?

with the same result. If I remove null value from data everything works as it should. How can I make reduce work with null values?

我的udf定义如下:

val doubleUdf: UserDefinedFunction = udf((v: Any) => Try(v.toString.toDouble).toOption)

推荐答案

我将强烈假设您的doubleUdf函数将值转换为双精度,而不是将Option包装器用于null,而是将null转换为0.0.因此,如果您想保留删除空值的逻辑,请在进行其他操作之前先进行过滤:

I'll answer with a strong assumption that your doubleUdf function converts values to doubles, and rather than using an Option wrapper for nulls you are turning nulls into 0.0. So, if you want to keep the logic to drop nulls, then filter BEFORE anything else:

df.na.drop.select(doubleUdf(df("value"))).as[Double]

这篇关于Spark数据集减少为空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆