如何推断JSON文件的架构? [英] How to infer schema of JSON files?

查看:165
本文介绍了如何推断JSON文件的架构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Java中具有以下字符串

I have the following String in Java

{
    "header": {
        "gtfs_realtime_version": "1.0",
        "incrementality": 0,
        "timestamp": 1528460625,
        "user-data": "metra"
    },
    "entity": [{
            "id": "8424",
            "vehicle": {
                "trip": {
                    "trip_id": "UP-N_UN314_V1_D",
                    "route_id": "UP-N",
                    "start_time": "06:17:00",
                    "start_date": "20180608",
                    "schedule_relationship": 0
                },
                "vehicle": {
                    "id": "8424",
                    "label": "314"
                },
                "position": {
                    "latitude": 42.10085,
                    "longitude": -87.72896
                },
                "current_status": 2,
                "timestamp": 1528460601
            }
        }
    ]
}

表示JSON文档.我想在流应用程序 Spark 数据框中推断一个模式.

that represent a JSON document. I want to infer a schema in a Spark Dataframe for a streaming application.

如何类似于CSV文档(可以在其中调用.split("")的方式)拆分字符串的字段?

How can I split the fields of the String similarly to a CSV document (where I can call .split(""))?

推荐答案

引用官方文档流数据帧/数据集的架构推断和分区:

默认情况下,基于文件的源的结构化流要求您指定架构,而不是依靠Spark自动推断.此限制确保即使在发生故障的情况下,也可以将一致的架构用于流查询.对于临时用例,可以通过将spark.sql.streaming.schemaInference设置为true来重新启用模式推断.

By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.

然后可以使用spark.sql.streaming.schemaInference配置属性来启用架构推断.我不确定这是否适用于JSON文件.

You can then use spark.sql.streaming.schemaInference configuration property to enable schema inference. I'm not sure if that's going to work for JSON files.

我通常要做的是加载单个文件(在批处理查询中并且在开始流查询之前)以推断架构.这应该在您的情况下有效.只需执行以下操作即可.

What I usually do is to load a single file (in a batch query and before starting a streaming query) to infer the schema. That should work in your case. Just do the following.

// I'm leaving converting Scala to Java as a home exercise
val jsonSchema = spark
  .read
  .option("multiLine", true) // <-- the trick
  .json("sample.json")
  .schema
scala> jsonSchema.printTreeString
root
 |-- entity: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |-- current_status: long (nullable = true)
 |    |    |    |-- position: struct (nullable = true)
 |    |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |-- timestamp: long (nullable = true)
 |    |    |    |-- trip: struct (nullable = true)
 |    |    |    |    |-- route_id: string (nullable = true)
 |    |    |    |    |-- schedule_relationship: long (nullable = true)
 |    |    |    |    |-- start_date: string (nullable = true)
 |    |    |    |    |-- start_time: string (nullable = true)
 |    |    |    |    |-- trip_id: string (nullable = true)
 |    |    |    |-- vehicle: struct (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |-- header: struct (nullable = true)
 |    |-- gtfs_realtime_version: string (nullable = true)
 |    |-- incrementality: long (nullable = true)
 |    |-- timestamp: long (nullable = true)
 |    |-- user-data: string (nullable = true)

诀窍是使用multiLine选项,以便整个文件是用于从中推断架构的单个行.

The trick is to use multiLine option so the entire file is a single row that you use to infer schema from.

这篇关于如何推断JSON文件的架构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆