Spark:替换嵌套列中的Null值 [英] Spark: Replace Null value in a Nested column
本文介绍了Spark:替换嵌套列中的Null值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想将以下数据框中的所有 n/a
值替换为 unknown
.它可以是 scalar
或 complex嵌套列
.如果它是 StructField列
,我可以遍历这些列,并使用 WithColumn
替换 n \ a
.但是,我希望此列以 generic方式
完成,尽管该列的 type
因为我不想明确指定列名,因为在我的情况下有100个?
I would like to replace all the n/a
values in the below dataframe to unknown
.
It can be either scalar
or complex nested column
.
If it's a StructField column
I can loop through the columns and replace n\a
using WithColumn
.
But I would like this to be done in a generic way
inspite of the type
of the column
as I dont want to specify the column names explicitly as there are 100's in my case?
case class Bar(x: Int, y: String, z: String)
case class Foo(id: Int, name: String, status: String, bar: Seq[Bar])
val df = spark.sparkContext.parallelize(
Seq(
Foo(123, "Amy", "Active", Seq(Bar(1, "first", "n/a"))),
Foo(234, "Rick", "n/a", Seq(Bar(2, "second", "fifth"),Bar(22, "second", "n/a"))),
Foo(567, "Tom", "null", Seq(Bar(3, "second", "sixth")))
)).toDF
df.printSchema
df.show(20, false)
结果:
+---+----+------+---------------------------------------+
|id |name|status|bar |
+---+----+------+---------------------------------------+
|123|Amy |Active|[[1, first, n/a]] |
|234|Rick|n/a |[[2, second, fifth], [22, second, n/a]]|
|567|Tom |null |[[3, second, sixth]] |
+---+----+------+---------------------------------------+
预期输出:
+---+----+----------+---------------------------------------------------+
|id |name|status |bar |
+---+----+----------+---------------------------------------------------+
|123|Amy |Active |[[1, first, unknown]] |
|234|Rick|unknown |[[2, second, fifth], [22, second, unknown]] |
|567|Tom |null |[[3, second, sixth]] |
+---+----+----------+---------------------------------------------------+
对此有何建议?
推荐答案
如果您喜欢玩RDD,这是一个简单,通用且完善的解决方案:
If you like playing with RDDs, here's a simple, generic and evolutive solution :
val naToUnknown = {r: Row =>
def rec(r: Any): Any = {
r match {
case row: Row => Row.fromSeq(row.toSeq.map(rec))
case seq: Seq[Any] => seq.map(rec)
case s: String if s == "n/a" => "unknown"
case _ => r
}
}
Row.fromSeq(r.toSeq.map(rec))
}
val newDF = spark.createDataFrame(df.rdd.map{naToUnknown}, df.schema)
newDF.show(false)
输出:
+---+----+-------+-------------------------------------------+
|id |name|status |bar |
+---+----+-------+-------------------------------------------+
|123|Amy |Active |[[1, first, unknown]] |
|234|Rick|unknown|[[2, second, fifth], [22, second, unknown]]|
|567|Tom |null |[[3, second, sixth]] |
+---+----+-------+-------------------------------------------+
这篇关于Spark:替换嵌套列中的Null值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文