Spark Union因嵌套的JSON数据帧而失败 [英] Spark union fails with nested JSON dataframe
问题描述
我有以下两个JSON文件:
I have the following two JSON files:
{
"name" : "Agent1",
"age" : "32",
"details" : [{
"d1" : 1,
"d2" : 2
}
]
}
{
"name" : "Agent2",
"age" : "42",
"details" : []
}
我读着火花:
val jsonDf1 = spark.read.json(pathToJson1)
val jsonDf2 = spark.read.json(pathToJson2)
使用以下模式创建两个数据框:
two dataframes are created with the following schemas:
root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- d1: long (nullable = true)
| | |-- d2: long (nullable = true)
|-- name: string (nullable = true)
root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
| |-- element: string (containsNull = true)
|-- name: string (nullable = true)
当我尝试对这两个数据帧执行联合时,出现此错误:
When I try to perform a union with these two dataframes I get this error:
jsonDf1.union(jsonDf2)
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;;
'Union
:- LogicalRDD [age#0, details#1, name#2]
+- LogicalRDD [age#7, details#8, name#9]
我该如何解决?有时,在火花作业将加载的JSON文件中,我会得到空数组,但仍然必须将它们统一,这应该没问题,因为Json文件的架构是相同的.
How can I resolve this? I will get empty arrays sometimes in the JSON files the spark job will load, but it will still have to unify them, which shouldn't be a problem since the schema of the Json files is the same.
推荐答案
polomarcus的回答使我想到了以下解决方案: 我无法一次读取所有文件,因为我得到了文件列表作为输入,并且spark没有用于接收路径列表的API,但是显然使用Scala可以做到这一点:
polomarcus's answer led me to this solution: I couldn't read all the files at once because I got a list of files as input, and spark didn't have an API that receives a list of paths, but apparently with Scala it's possible to do this:
val files = List("path1", "path2", "path3")
val dataframe = spark.read.json(files: _*)
这样,我得到了一个包含所有三个文件的数据框.
This way I got one dataframe containing all three files.
这篇关于Spark Union因嵌套的JSON数据帧而失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!