Spark union 因嵌套的 JSON 数据框而失败 [英] Spark union fails with nested JSON dataframe

查看：22 发布时间：2021/11/14 22:39:46 scala apache-spark union spark-dataframe

本文介绍了Spark union 因嵌套的 JSON 数据框而失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下两个 JSON 文件:

I have the following two JSON files:

{
    "name" : "Agent1",
    "age" : "32",
    "details" : [{
            "d1" : 1,
            "d2" : 2
        }
    ]
}

{
    "name" : "Agent2",
    "age" : "42",
    "details" : []
}

我用火花阅读它们:

val jsonDf1 = spark.read.json(pathToJson1)
val jsonDf2 = spark.read.json(pathToJson2)

使用以下模式创建了两个数据框:

two dataframes are created with the following schemas:

root
 |-- age: string (nullable = true)
 |-- details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- d1: long (nullable = true)
 |    |    |-- d2: long (nullable = true)
 |-- name: string (nullable = true)

root
|-- age: string (nullable = true)
|-- details: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- name: string (nullable = true)

当我尝试对这两个数据帧执行联合时，出现此错误:

When I try to perform a union with these two dataframes I get this error:

jsonDf1.union(jsonDf2)


org.apache.spark.sql.AnalysisException: unresolved operator 'Union;;
'Union
:- LogicalRDD [age#0, details#1, name#2]
+- LogicalRDD [age#7, details#8, name#9]

我该如何解决这个问题?我有时会在 Spark 作业将加载的 JSON 文件中得到空数组，但它仍然必须统一它们，这应该不是问题，因为 Json 文件的架构是相同的.

How can I resolve this? I will get empty arrays sometimes in the JSON files the spark job will load, but it will still have to unify them, which shouldn't be a problem since the schema of the Json files is the same.

推荐答案

polomarcus 的回答让我找到了这个解决方案:我无法一次读取所有文件，因为我有一个文件列表作为输入，而 spark 没有接收路径列表的 API，但显然使用 Scala 可以做到这一点:

polomarcus's answer led me to this solution: I couldn't read all the files at once because I got a list of files as input, and spark didn't have an API that receives a list of paths, but apparently with Scala it's possible to do this:

val files = List("path1", "path2", "path3")
val dataframe = spark.read.json(files: _*)

这样我就得到了一个包含所有三个文件的数据框.

This way I got one dataframe containing all three files.

这篇关于Spark union 因嵌套的 JSON 数据框而失败的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark union 因嵌套的 JSON 数据框而失败 [英] Spark union fails with nested JSON dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark union 因嵌套的 JSON 数据框而失败 [英] Spark union fails with nested JSON dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭