Json文件的Pyspark模式 [英] Pyspark Schema for Json file

查看：76 发布时间：2021/2/13 21:01:31 json apache-spark pyspark apache-spark-sql

本文介绍了Json文件的Pyspark模式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将复杂的json文件读入spark数据框. Spark可以识别模式，但是会将字段错误地视为字符串，而该字符串恰好是一个空数组. (不确定必须为数组类型时为何为String类型) 以下是我期望的示例

I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting

arrayfield:[{"name":"somename"},{"address" : "someadress"}]

现在数据如下

arrayfield:[]

这对我的代码的作用是，每当我尝试查询arrayfield.name时，它都会失败.我知道我可以在读取文件时输入模式，但是由于json结构非常复杂，因此从头开始编写它并不能真正解决问题.我尝试使用df.schema(显示在StructType中)获取架构，并根据需要对其进行修改，但是如何将字符串传递回StructType中呢?这可能真的很愚蠢，但是我发现很难解决这个问题.是否有任何工具/实用工具可以帮助我生成strutType

what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading the file but since the json structure is really complex writing it from scratch doesn't really work out. I tried getting the schema using df.schema(which displays in StructType) and modifying it as per my requirement but how do i pass back a string into a StructType ? This might be really silly but i am finding it hard to fix this. Are there any tool / utility which would help me generate the strutType

推荐答案

您需要将StructType对象传递给DF构造函数.

You need to pass StructType object to DF constructor.

假设执行后您的DF出现错误

Let's say Your DF with mistakes after executing

df.schema

打印输出，如下所示:

StructType(List(StructField(data1,StringType,true),StructField(data2,StringType,true)))

因此，您需要将此字符串转换为可执行脚本.

so, You need need translate this string into executable script.

为类型添加导入

Add an import for types

from pyspark.sql.types import *

将列表和括号更改为python的括号

Change List and parentheses to python's brackets

List() -> []

在每个类型声明之后添加括号

After each type declaration add parentheses

StringType -> StringType()

修复布尔值字符串

Fix boolean values strings

true -> True

将其分配给变量

Assign it to variable

schema = StructType([
        StructField("data1", StringType(),True),
        StructField("data2", StringType(),True)])

创建新的DF对象

Create new DF object

spark.read.csv(path, schema=schema)

然后您就完成了.

这篇关于Json文件的Pyspark模式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Json文件的Pyspark模式 [英] Pyspark Schema for Json file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Json文件的Pyspark模式 [英] Pyspark Schema for Json file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭