在 Spark 中创建给定架构的空数组列 [英] create empty array-column of given schema in Spark
问题描述
由于 parquet 不能解析空数组,我在写表之前用 null 替换了空数组.现在,当我阅读表格时,我想做相反的事情:
Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:
我有一个具有以下架构的 DataFrame:
I have a DataFrame with the following schema :
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
以及以下内容:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
我想用空数组替换空数组(id=2),即
I'd like to replace the null-array (id=2) with an empty array, i.e.
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
我试过了:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
给出:
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$无法转换为 org.apache.spark.sql.types.StructType
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$ cannot be cast to org.apache.spark.sql.types.StructType
我不想硬编码"我的数组列的任何架构(至少不是结构的架构),因为这可能因情况而异.我只能在运行时使用 df
中的架构信息
Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df
at runtime
顺便说一下,我使用的是 Spark 2.1,因此我不能使用 typedLit
I'm using Spark 2.1 by the way, therefore I cannot use typedLit
推荐答案
一种方法是使用 UDF :
One way is the use a UDF :
val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)
val emptyArr = udf(() => Seq.empty[Any],arrSchema)
df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
这篇关于在 Spark 中创建给定架构的空数组列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!