Spark union 因嵌套的 JSON 数据框而失败 [英] Spark union fails with nested JSON dataframe
问题描述
我有以下两个 JSON 文件:
I have the following two JSON files:
{
"name" : "Agent1",
"age" : "32",
"details" : [{
"d1" : 1,
"d2" : 2
}
]
}
{
"name" : "Agent2",
"age" : "42",
"details" : []
}
我用火花阅读它们:
val jsonDf1 = spark.read.json(pathToJson1)
val jsonDf2 = spark.read.json(pathToJson2)
使用以下模式创建了两个数据框:
two dataframes are created with the following schemas:
root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- d1: long (nullable = true)
| | |-- d2: long (nullable = true)
|-- name: string (nullable = true)
root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
| |-- element: string (containsNull = true)
|-- name: string (nullable = true)
当我尝试对这两个数据帧执行联合时,出现此错误:
When I try to perform a union with these two dataframes I get this error:
jsonDf1.union(jsonDf2)
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;;
'Union
:- LogicalRDD [age#0, details#1, name#2]
+- LogicalRDD [age#7, details#8, name#9]
我该如何解决这个问题?我有时会在 Spark 作业将加载的 JSON 文件中得到空数组,但它仍然必须统一它们,这应该不是问题,因为 Json 文件的架构是相同的.
How can I resolve this? I will get empty arrays sometimes in the JSON files the spark job will load, but it will still have to unify them, which shouldn't be a problem since the schema of the Json files is the same.
推荐答案
polomarcus 的回答让我找到了这个解决方案:我无法一次读取所有文件,因为我有一个文件列表作为输入,而 spark 没有接收路径列表的 API,但显然使用 Scala 可以做到这一点:
polomarcus's answer led me to this solution: I couldn't read all the files at once because I got a list of files as input, and spark didn't have an API that receives a list of paths, but apparently with Scala it's possible to do this:
val files = List("path1", "path2", "path3")
val dataframe = spark.read.json(files: _*)
这样我就得到了一个包含所有三个文件的数据框.
This way I got one dataframe containing all three files.
这篇关于Spark union 因嵌套的 JSON 数据框而失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!