配置文件以在PySpark中定义JSON模式结构 [英] Config file to define JSON Schema Structure in PySpark
问题描述
我创建了一个PySpark应用程序,该应用程序通过定义的架构读取数据框中的JSON文件.下面的代码示例
I have created a PySpark application that reads the JSON file in a dataframe through a defined Schema. code sample below
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
df= sqlContext.read.json(file, schema)
我需要一种方法来找到如何在某种配置或ini文件等中定义此架构的方法,并在主要的PySpark应用程序中阅读它.
I need a way to find how can I define this schema in a kind of config or ini file etc. And read that in the main the PySpark application.
如果将来有任何需要,而无需更改主要的PySpark代码,这将帮助我修改用于更改的JSON的架构.
This will help me to modify schema for the changing JSON if there is any need in future without changing the main PySpark code.
推荐答案
StructType
提供了json
和jsonValue
方法,可用于分别获取json
和dict
表示,以及fromJson
用于将Python词典转换为StructType
.
StructType
provides json
and jsonValue
methods which can be used to obtain json
and dict
representation respectively and fromJson
which can be used to convert Python dictionary to StructType
.
schema = StructType([
StructField("domain", StringType(), True),
StructField("timestamp", LongType(), True),
])
StructType.fromJson(schema.jsonValue())
您唯一需要的就是内置的 json
模块将输入解析为StructType
可以使用的dict
输入.
The only thing you need beyond that is built-in json
module to parse input to the dict
that can be consumed by StructType
.
对于Scala版本,请参见如何从CSV文件创建模式并将该模式持久/保存到文件?
For Scala version see How to create a schema from CSV file and persist/save that schema to a file?
这篇关于配置文件以在PySpark中定义JSON模式结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!