Json 文件的 Pyspark 架构 [英] Pyspark Schema for Json file
问题描述
我正在尝试将复杂的 json 文件读入 Spark 数据帧.Spark 识别模式但将字段错误地视为字符串,而该字符串恰好是一个空数组.(不确定为什么它必须是数组类型时它是 String 类型)以下是我期待的样本
I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting
arrayfield:[{"name":"somename"},{"address" : "someadress"}]
现在的数据如下
arrayfield:[]
这对我的代码有什么影响,当我尝试查询 arrayfield.name 时它会失败.我知道我可以在读取文件时输入模式,但是由于 json 结构非常复杂,因此从头开始编写它并没有真正解决.我尝试使用 df.schema(显示在 StructType 中)获取架构并根据我的要求对其进行修改,但是如何将字符串传回 StructType ?这可能真的很愚蠢,但我发现很难解决这个问题.是否有任何工具/实用程序可以帮助我生成 strutType
what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading the file but since the json structure is really complex writing it from scratch doesn't really work out. I tried getting the schema using df.schema(which displays in StructType) and modifying it as per my requirement but how do i pass back a string into a StructType ? This might be really silly but i am finding it hard to fix this. Are there any tool / utility which would help me generate the strutType
推荐答案
您需要将 StructType 对象传递给 DF 构造函数.
You need to pass StructType object to DF constructor.
假设你的 DF 执行后有错误
Let's say Your DF with mistakes after executing
df.schema
打印输出如下:
StructType(List(StructField(data1,StringType,true),StructField(data2,StringType,true)))
因此,您需要将此字符串转换为可执行脚本.
so, You need need translate this string into executable script.
为类型添加导入
Add an import for types
from pyspark.sql.types import *
将列表和括号更改为python的括号
Change List and parentheses to python's brackets
List() -> []
在每个类型声明后添加括号
After each type declaration add parentheses
StringType -> StringType()
修复布尔值字符串
Fix boolean values strings
true -> True
赋值给变量
Assign it to variable
schema = StructType([
StructField("data1", StringType(),True),
StructField("data2", StringType(),True)])
创建新的 DF 对象
Create new DF object
spark.read.csv(path, schema=schema)
大功告成.
这篇关于Json 文件的 Pyspark 架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!