如何访问 JSON 文件中的子实体? [英] How to access sub-entities in JSON file?
问题描述
我有一个像这样的 json 文件:
I have a json file look like this:
{
"employeeDetails":{
"name": "xxxx",
"num":"415"
},
"work":[
{
"monthYear":"01/2007",
"workdate":"1|2|3|....|31",
"workhours":"8|8|8....|8"
},
{
"monthYear":"02/2007",
"workdate":"1|2|3|....|31",
"workhours":"8|8|8....|8"
}
]
}
我必须从这个 json 数据中获取工作日期、工作时间.
I have to get the workdate, workhours from this json data.
我是这样试的:
import org.apache.spark.{SparkConf, SparkContext}
object JSON2 {
def main (args: Array[String]) {
val spark =
SparkSession.builder()
.appName("SQL-JSON")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val employees = spark.read.json("sample.json")
employees.printSchema()
employees.select("employeeDetails").show()
}
}
我收到这样的异常:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`employeeDetails`' given input columns: [_corrupt_record];;
'Project ['employeeDetails]
+- Relation[_corrupt_record#0] json
我是 Spark 的新手.
I am new to Spark.
推荐答案
给定输入列:[_corrupt_record];;
given input columns: [_corrupt_record];;
原因是 Spark 支持 JSON 文件,其中每行必须包含一个单独的、自包含的有效 JSON 对象."
The reason is that Spark supports JSON files in which "Each line must contain a separate, self-contained valid JSON object."
引用 JSON 数据集:
请注意,作为 json 文件提供的文件不是典型的 JSON 文件.每行必须包含一个单独的、自包含的有效 JSON 对象.有关更多信息,请参阅 JSON 行文本格式,也称为换行符分隔的 JSON.因此,常规的多行 JSON 文件通常会失败.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.
如果 JSON 文件对 Spark 不正确,它会将其存储在 _corrupt_record
下(您可以使用 columnNameOfCorruptRecord
选项进行更改).
In case a JSON file is incorrect for Spark it will store it under _corrupt_record
(that you can change using columnNameOfCorruptRecord
option).
scala> spark.read.json("employee.json").printSchema
root
|-- _corrupt_record: string (nullable = true)
您的文件不正确,不仅因为它是多行 JSON,还因为 jq(轻量级且灵活的命令行 JSON 处理器)如是说.
And your file is incorrect not only bacause it's a multi-line JSON, but also because jq (a lightweight and flexible command-line JSON processor) says so.
$ cat incorrect.json
{
"employeeDetails":{
"name": "xxxx",
"num:"415"
}
"work":[
{
"monthYear":"01/2007"
"workdate":"1|2|3|....|31",
"workhours":"8|8|8....|8"
},
{
"monthYear":"02/2007"
"workdate":"1|2|3|....|31",
"workhours":"8|8|8....|8"
}
],
}
$ cat incorrect.json | jq
parse error: Expected separator between values at line 4, column 14
修复 JSON 文件后,使用以下技巧加载多行 JSON 文件.
Once you fix the JSON file, use the following trick to load the multi-line JSON file.
scala> spark.version
res5: String = 2.1.1
val employees = spark.read.json(sc.wholeTextFiles("employee.json").values)
scala> employees.printSchema
root
|-- employeeDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- num: string (nullable = true)
|-- work: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- monthYear: string (nullable = true)
| | |-- workdate: string (nullable = true)
| | |-- workhours: string (nullable = true)
scala> employees.select("employeeDetails").show()
+---------------+
|employeeDetails|
+---------------+
| [xxxx,415]|
+---------------+
火花 >= 2.2
从 Spark 2.2 开始(最近发布 并强烈推荐使用),您应该使用 multiLine
选项代替.multiLine
选项已添加到 SPARK-20980 重命名选项 wholeFile
到 multiLine
用于 JSON 和 CSV.
Spark >= 2.2
As of Spark 2.2 (released quite recently and highly recommended to use), you should use multiLine
option instead. multiLine
option was added in SPARK-20980 Rename the option wholeFile
to multiLine
for JSON and CSV.
scala> spark.version
res0: String = 2.2.0
scala> spark.read.option("multiLine", true).json("employee.json").printSchema
root
|-- employeeDetails: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- num: string (nullable = true)
|-- work: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- monthYear: string (nullable = true)
| | |-- workdate: string (nullable = true)
| | |-- workhours: string (nullable = true)
这篇关于如何访问 JSON 文件中的子实体?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!