如何推断JSON文件的架构? [英] How to infer schema of JSON files?
问题描述
我在Java中具有以下字符串
I have the following String in Java
{
"header": {
"gtfs_realtime_version": "1.0",
"incrementality": 0,
"timestamp": 1528460625,
"user-data": "metra"
},
"entity": [{
"id": "8424",
"vehicle": {
"trip": {
"trip_id": "UP-N_UN314_V1_D",
"route_id": "UP-N",
"start_time": "06:17:00",
"start_date": "20180608",
"schedule_relationship": 0
},
"vehicle": {
"id": "8424",
"label": "314"
},
"position": {
"latitude": 42.10085,
"longitude": -87.72896
},
"current_status": 2,
"timestamp": 1528460601
}
}
]
}
表示JSON文档.我想在流应用程序的 Spark 数据框中推断一个模式.
that represent a JSON document. I want to infer a schema in a Spark Dataframe for a streaming application.
如何类似于CSV文档(可以在其中调用.split("")
的方式)拆分字符串的字段?
How can I split the fields of the String similarly to a CSV document (where I can call .split("")
)?
推荐答案
引用官方文档流数据帧/数据集的架构推断和分区:
默认情况下,基于文件的源的结构化流要求您指定架构,而不是依靠Spark自动推断.此限制确保即使在发生故障的情况下,也可以将一致的架构用于流查询.对于临时用例,可以通过将
spark.sql.streaming.schemaInference
设置为true来重新启用模式推断.
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference
to true.
然后可以使用spark.sql.streaming.schemaInference
配置属性来启用架构推断.我不确定这是否适用于JSON文件.
You can then use spark.sql.streaming.schemaInference
configuration property to enable schema inference. I'm not sure if that's going to work for JSON files.
我通常要做的是加载单个文件(在批处理查询中并且在开始流查询之前)以推断架构.这应该在您的情况下有效.只需执行以下操作即可.
What I usually do is to load a single file (in a batch query and before starting a streaming query) to infer the schema. That should work in your case. Just do the following.
// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
.read
.option("multiLine", true) // <-- the trick
.json("sample.json")
.schema
scala> jsonSchema.printTreeString
root
|-- entity: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- vehicle: struct (nullable = true)
| | | |-- current_status: long (nullable = true)
| | | |-- position: struct (nullable = true)
| | | | |-- latitude: double (nullable = true)
| | | | |-- longitude: double (nullable = true)
| | | |-- timestamp: long (nullable = true)
| | | |-- trip: struct (nullable = true)
| | | | |-- route_id: string (nullable = true)
| | | | |-- schedule_relationship: long (nullable = true)
| | | | |-- start_date: string (nullable = true)
| | | | |-- start_time: string (nullable = true)
| | | | |-- trip_id: string (nullable = true)
| | | |-- vehicle: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- label: string (nullable = true)
|-- header: struct (nullable = true)
| |-- gtfs_realtime_version: string (nullable = true)
| |-- incrementality: long (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- user-data: string (nullable = true)
诀窍是使用multiLine
选项,以便整个文件是用于从中推断架构的单个行.
The trick is to use multiLine
option so the entire file is a single row that you use to infer schema from.
这篇关于如何推断JSON文件的架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!