Json 文件的 Pyspark 架构 [英] Pyspark Schema for Json file

查看:26
本文介绍了Json 文件的 Pyspark 架构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将复杂的 json 文件读入 Spark 数据帧.Spark 识别模式但将字段错误地视为字符串,而该字符串恰好是一个空数组.(不确定为什么它必须是数组类型时它是 String 类型)以下是我期待的样本

I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting

arrayfield:[{"name":"somename"},{"address" : "someadress"}]

现在的数据如下

arrayfield:[]

这对我的代码有什么影响,当我尝试查询 arrayfield.name 时它会失败.我知道我可以在读取文件时输入模式,但是由于 json 结构非常复杂,因此从头开始编写它并没有真正解决.我尝试使用 df.schema(显示在 StructType 中)获取架构并根据我的要求对其进行修改,但是如何将字符串传回 StructType ?这可能真的很愚蠢,但我发现很难解决这个问题.是否有任何工具/实用程序可以帮助我生成 strutType

what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading the file but since the json structure is really complex writing it from scratch doesn't really work out. I tried getting the schema using df.schema(which displays in StructType) and modifying it as per my requirement but how do i pass back a string into a StructType ? This might be really silly but i am finding it hard to fix this. Are there any tool / utility which would help me generate the strutType

推荐答案

您需要将 StructType 对象传递给 DF 构造函数.

You need to pass StructType object to DF constructor.

假设你的 DF 执行后有错误

Let's say Your DF with mistakes after executing

df.schema

打印输出如下:

StructType(List(StructField(data1,StringType,true),StructField(data2,StringType,true)))

因此,您需要将此字符串转换为可执行脚本.

so, You need need translate this string into executable script.

  1. 为类型添加导入

  1. Add an import for types

from pyspark.sql.types import *

  • 将列表和括号更改为python的括号

  • Change List and parentheses to python's brackets

    List() -> []
    

  • 在每个类型声明后添加括号

  • After each type declaration add parentheses

    StringType -> StringType()
    

  • 修复布尔值字符串

  • Fix boolean values strings

    true -> True
    

  • 赋值给变量

  • Assign it to variable

    schema = StructType([
            StructField("data1", StringType(),True),
            StructField("data2", StringType(),True)])
    

  • 创建新的 DF 对象

  • Create new DF object

    spark.read.csv(path, schema=schema)
    

  • 大功告成.

    这篇关于Json 文件的 Pyspark 架构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆