使用 Spark (1.6) 从 Scala 数据框中的数组列中删除空值 [英] Remove Null from Array Columns in Dataframe in Scala with Spark (1.6)

查看:185
本文介绍了使用 Spark (1.6) 从 Scala 数据框中的数组列中删除空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含一个键列和一个包含结构数组的列.架构如下所示.

I have a dataframe with a key column and a column which has an array of struct. The Schema looks like below.

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: long (nullable = false)

数组desc"可以有任意数量的空值.我想使用 spark 1.6 创建一个没有空值的数组的最终数据帧:

The array "desc" can have any number of null values. I would like to create a final dataframe with the array having none of the null values using spark 1.6:

一个例子是:

Key  .   Value
1010 .   [[George,21],null,[MARIE,13],null]
1023 .   [null,[Watson,11],[John,35],null,[Kyle,33]]

我希望最终的数据帧为:

I want the final dataframe as:

Key  .   Value
1010 .   [[George,21],[MARIE,13]]
1023 .   [[Watson,11],[John,35],[Kyle,33]]

我尝试使用 UDF 和 case 类来执行此操作,但得到了

I tried doing this with UDF and case class but got

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to....

非常感谢任何帮助,如果需要,我更愿意在不转换为 RDD 的情况下进行.我也是 spark 和 scala 的新手,所以提前致谢!!!

Any help is greatly appreciated and I would prefer doing it without converting to RDDs if needed. Also I am new to spark and scala so thanks in advance!!!

推荐答案

鉴于原始数据帧具有以下架构

Given that the original dataframe has following schema

root
 |-- id: string (nullable = true)
 |-- desc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- age: long (nullable = false)

定义一个udf函数从数组中删除空值应该对你有用

import org.apache.spark.sql.functions._
def removeNull = udf((array: Seq[Row])=> array.filterNot(_ == null).map(x => element(x.getAs[String]("name"), x.getAs[Long]("age"))))

df.withColumn("desc", removeNull(col("desc")))

其中 element 是一个 case 类

case class element(name: String, age: Long)

你应该得到

+----+-----------------------------------+
|id  |desc                               |
+----+-----------------------------------+
|1010|[[George,21], [MARIE,13]]          |
|1010|[[Watson,11], [John,35], [Kyle,33]]|
+----+-----------------------------------+

这篇关于使用 Spark (1.6) 从 Scala 数据框中的数组列中删除空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆