Json 文件到 pyspark 数据框 [英] Json file to pyspark dataframe

查看:33
本文介绍了Json 文件到 pyspark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 spark (pyspark) 环境中使用 JSON 文件.

问题:无法在 Pyspark Dataframe 中将 JSON 转换为预期格式

第一个输入数据集:

<块引用>

df.printSchema()

root|-- 数据:数组(可为空 = 真)||-- 元素:数组(containsNull = true)|||-- 元素:字符串(containsNull = true)|-- meta: struct (nullable = true)||-- 视图: struct (nullable = true)|||-- 属性:字符串(可为空 = 真)|||-- attributionLink: string (nullable = true)|||-- averageRating: long (nullable = true)|||-- 类别:字符串(可为空 = 真)|||-- 列:数组(可为空 = 真)||||-- 元素: struct (containsNull = true)|||||-- cachedContents: struct (nullable = true)||||||-- 平均值:字符串(可为空 = 真)||||||-- 最大:字符串(可为空 = 真)||||||-- non_null: long (nullable = true)||||||-- null: long (nullable = true)||||||-- 最小:字符串(可为空 = 真)||||||-- sum: string (nullable = true)||||||-- 顶部:数组(可为空 = 真)|||||||-- 元素: struct (containsNull = true)||||||||-- 计数:长(可为空 = 真)||||||||-- 项目:字符串(可为空 = 真)|||||-- dataTypeName: string (nullable = true)|||||-- 描述:字符串(可为空 = 真)|||||-- fieldName: string (nullable = true)|||||-- 标志:数组(可为空 = 真)||||||-- 元素:字符串(containsNull = true)|||||-- 格式: struct (nullable = true)||||||-- 对齐:字符串(可为空 = 真)||||||-- 掩码:字符串(可为空 = 真)||||||-- noCommas: string (nullable = true)||||||-- precisionStyle: string (nullable = true)|||||-- id: long (nullable = true)|||||-- 名称:字符串(可为空 = 真)|||||-- 头寸:多头(可为空 = 真)|||||-- renderTypeName: string (nullable = true)|||||-- tableColumnId: long (nullable = true)|||||-- 宽度:长(可为空 = 真)|||-- createdAt: long (nullable = true)|||-- 描述:字符串(可为空 = 真)|||-- displayType: string (nullable = true)|||-- downloadCount: long (nullable = true)|||-- 标志:数组(可为空 = 真)||||-- 元素:字符串(containsNull = true)|||-- 授予:数组(可为空 = 真)||||-- 元素: struct (containsNull = true)|||||-- 标志:数组(可为空 = 真)||||||-- 元素:字符串(containsNull = true)|||||-- 继承:boolean (nullable = true)|||||-- 类型:字符串(可为空 = 真)|||-- hideFromCatalog: boolean (nullable = true)|||-- hideFromDataJson: boolean (nullable = true)|||-- id: string (nullable = true)|||-- indexUpdatedAt: long (nullable = true)|||-- 元数据: struct (nullable = true)||||-- 附件:数组(可为空 = 真)|||||-- 元素: struct (containsNull = true)||||||-- assetId: string (nullable = true)||||||-- blobId: string (nullable = true)||||||-- 文件名:字符串(可为空 = 真)||||||-- 名称:字符串(可为空 = 真)||||-- availableDisplayTypes: array (nullable = true)|||||-- 元素:字符串(containsNull = true)||||-- custom_fields: struct (nullable = true)|||||-- 附加资源:struct (nullable = true)||||||-- 另请参见:string (nullable = true)|||||-- 数据集信息:struct (nullable = true)||||||-- 代理:字符串(可为空 = 真)|||||-- 数据集摘要:struct (nullable = true)||||||-- 联系信息: string (nullable = true)||||||-- Coverage: string (nullable = true)||||||-- 数据频率:字符串(可为空 = 真)||||||-- 数据集所有者:字符串(可为空 = 真)||||||-- 粒度:字符串(可为空 = 真)||||||-- 组织:字符串(可为空 = 真)||||||-- 发帖频率: string (nullable = true)||||||-- 时间段:字符串(可为空 = 真)||||||-- 单位:字符串(可为空 = 真)|||||-- 免责声明:struct (nullable = true)||||||-- 限制:字符串(可为空 = 真)|||||-- 本地数据: struct (nullable = true)||||||-- 县过滤器:字符串(可为空 = 真)||||||-- County_Column: string (nullable = true)||||-- filterCondition: struct (nullable = true)|||||-- 孩子:数组(可为空=真)||||||-- 元素: struct (containsNull = true)|||||||-- 元数据: struct (nullable = true)||||||||-- includeAuto: long (nullable = true)||||||||-- multiSelect: boolean (nullable = true)||||||||-- 运算符:字符串(可为空 = 真)||||||||-- tableColumnId: struct (nullable = true)|||||||||-- 583607:长(可为空 = 真)|||||||-- 类型:字符串(可为空 = 真)|||||||-- 值:字符串(可为空 = 真)|||||-- 元数据: struct (nullable = true)||||||-- 高级:布尔值(可为空 = 真)||||||-- 统一版本: long (nullable = true)|||||-- 类型:字符串(可为空 = 真)|||||-- 值:字符串(可为空 = 真)||||-- jsonQuery: struct (nullable = true)|||||-- 顺序:数组(可为空 = 真)||||||-- 元素: struct (containsNull = true)|||||||-- 升序:布尔值(可为空 = 真)|||||||-- columnFieldName: string (nullable = true)||||-- rdfSubject: string (nullable = true)||||-- renderTypeConfig: struct (nullable = true)|||||-- 可见: struct (nullable = true)||||||-- 表:布尔值(可为空 = 真)||||-- rowLabel: string (nullable = true)|||-- 名称:字符串(可为空 = 真)|||-- newBackend: boolean (nullable = true)|||-- numberOfComments: long (nullable = true)|||-- oid: long (nullable = true)|||-- 所有者: struct (nullable = true)||||-- displayName: string (nullable = true)||||-- 标志:数组(可为空 = 真)|||||-- 元素:字符串(containsNull = true)||||-- id: string (nullable = true)||||-- profileImageUrlLarge: string (nullable = true)||||-- profileImageUrlMedium: string (nullable = true)||||-- profileImageUrlSmall: string (nullable = true)||||-- 屏幕名称:字符串(可为空 = 真)||||-- 类型:字符串(可为空 = 真)|||-- 出处:字符串(可为空 = 真)|||-- PublicationAppendEnabled: boolean (nullable = true)|||-- PublicationDate: long (nullable = true)|||-- PublicationGroup: long (nullable = true)|||-- PublicationStage: 字符串(可为空 = 真)|||-- 查询:struct (nullable = true)||||-- orderBys: 数组 (nullable = true)|||||-- 元素: struct (containsNull = true)||||||-- 升序:布尔值(可为空 = 真)||||||-- 表达式: struct (nullable = true)|||||||-- columnId: long (nullable = true)|||||||-- 类型:字符串(可为空 = 真)|||-- 权限:数组(可为空 = 真)||||-- 元素:字符串(containsNull = true)|||-- rowsUpdatedAt: long (nullable = true)|||-- rowsUpdatedBy: 字符串 (nullable = true)|||-- tableAuthor: struct (nullable = true)||||-- displayName: string (nullable = true)||||-- 标志:数组(可为空 = 真)|||||-- 元素:字符串(containsNull = true)||||-- id: string (nullable = true)||||-- profileImageUrlLarge: string (nullable = true)||||-- profileImageUrlMedium: string (nullable = true)||||-- profileImageUrlSmall: string (nullable = true)||||-- 屏幕名称:字符串(可为空 = 真)||||-- 类型:字符串(可为空 = 真)|||-- tableId: long (nullable = true)|||-- 标签:数组(可为空 = 真)||||-- 元素:字符串(containsNull = true)|||-- totalTimesRated: long (nullable = true)|||-- viewCount: long (nullable = true)|||-- viewLastModified: long (nullable = true)|||-- viewType: string (nullable = true)

问题: 所有记录都包含在单行两列中,即元数据和数据.同样使用 Spark 原生 JSON 实用程序 - Spark 会自动推断架构(元数据) - 我的期望是它不应该明确作为数据帧上的单独列.

预期产出JSON 数据集具有以下列列表.它应该在数据框中以表格格式显示它们,我可以在其中查询它们

设施、地址、上次检查、违规、总严重违规、总暴击.未更正、非关键违规总数、描述、当地卫生部门、县、设施地址、城市、邮政编码、NYSDOH GAZETTEER (1980)、城市、经营名称、许可证到期日期、许可证/许可、已批准/已批准 (D/BPER)姓名,烫发.经营者姓氏,PERM.经营者姓氏,PERM.操作者姓名、纽约州卫生操作 ID、检查类型、检查意见、食品服务设施州、位置 1

第二个输入数据集:现场首个世界银行资助项目数据集

<块引用>

http://jsonstudio.com/resources/

(现场第一批世界银行资助项目数据集)

一切正常.

df=sqlContext.read.json("/user/train/wb.json")df.count()500

第二个输入数据集可以正常工作,但第一个输入数据集不能.我的观察是两个 Json 文件定义元数据的方式不同.在 1 日.首先定义元数据,然后定义第二个文件中的数据 - 每一行都有数据.

您能否指导我了解第一个输入 JSON 文件格式以及如何在将其转换为 pyspark 数据帧时处理这种情况?

更新结果:经过初步分析,我们发现该格式似乎有误,但社区成员提供了另一种阅读格式的方法.将答案标记为正确并关闭此线程.

如果您需要更多详细信息,请告诉我,提前致谢.

解决方案

databricks

并且列名是分开存放的

column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

I'm trying to work with JSON file on spark (pyspark) environment.

Problem: Unable to convert JSON to expected format in Pyspark Dataframe

1st Input data set:

https://health.data.ny.gov/api/views/cnih-y5dw/rows.json

In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data".

FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark environment.

df=sqlContext.read.json("/user/train/ny.json",multiLine=True)
df.count()
out[5]: 1
df.show()

df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- attribution: string (nullable = true)
 |    |    |-- attributionLink: string (nullable = true)
 |    |    |-- averageRating: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- columns: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cachedContents: struct (nullable = true)
 |    |    |    |    |    |-- average: string (nullable = true)
 |    |    |    |    |    |-- largest: string (nullable = true)
 |    |    |    |    |    |-- non_null: long (nullable = true)
 |    |    |    |    |    |-- null: long (nullable = true)
 |    |    |    |    |    |-- smallest: string (nullable = true)
 |    |    |    |    |    |-- sum: string (nullable = true)
 |    |    |    |    |    |-- top: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- count: long (nullable = true)
 |    |    |    |    |    |    |    |-- item: string (nullable = true)
 |    |    |    |    |-- dataTypeName: string (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- fieldName: string (nullable = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- format: struct (nullable = true)
 |    |    |    |    |    |-- align: string (nullable = true)
 |    |    |    |    |    |-- mask: string (nullable = true)
 |    |    |    |    |    |-- noCommas: string (nullable = true)
 |    |    |    |    |    |-- precisionStyle: string (nullable = true)
 |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- position: long (nullable = true)
 |    |    |    |    |-- renderTypeName: string (nullable = true)
 |    |    |    |    |-- tableColumnId: long (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |-- createdAt: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- displayType: string (nullable = true)
 |    |    |-- downloadCount: long (nullable = true)
 |    |    |-- flags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- grants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- inherited: boolean (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- hideFromCatalog: boolean (nullable = true)
 |    |    |-- hideFromDataJson: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- indexUpdatedAt: long (nullable = true)
 |    |    |-- metadata: struct (nullable = true)
 |    |    |    |-- attachments: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assetId: string (nullable = true)
 |    |    |    |    |    |-- blobId: string (nullable = true)
 |    |    |    |    |    |-- filename: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- availableDisplayTypes: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- custom_fields: struct (nullable = true)
 |    |    |    |    |-- Additional Resources: struct (nullable = true)
 |    |    |    |    |    |-- See Also: string (nullable = true)
 |    |    |    |    |-- Dataset Information: struct (nullable = true)
 |    |    |    |    |    |-- Agency: string (nullable = true)
 |    |    |    |    |-- Dataset Summary: struct (nullable = true)
 |    |    |    |    |    |-- Contact Information: string (nullable = true)
 |    |    |    |    |    |-- Coverage: string (nullable = true)
 |    |    |    |    |    |-- Data Frequency: string (nullable = true)
 |    |    |    |    |    |-- Dataset Owner: string (nullable = true)
 |    |    |    |    |    |-- Granularity: string (nullable = true)
 |    |    |    |    |    |-- Organization: string (nullable = true)
 |    |    |    |    |    |-- Posting Frequency: string (nullable = true)
 |    |    |    |    |    |-- Time Period: string (nullable = true)
 |    |    |    |    |    |-- Units: string (nullable = true)
 |    |    |    |    |-- Disclaimers: struct (nullable = true)
 |    |    |    |    |    |-- Limitations: string (nullable = true)
 |    |    |    |    |-- Local Data: struct (nullable = true)
 |    |    |    |    |    |-- County Filter: string (nullable = true)
 |    |    |    |    |    |-- County_Column: string (nullable = true)
 |    |    |    |-- filterCondition: struct (nullable = true)
 |    |    |    |    |-- children: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |    |    |-- includeAuto: long (nullable = true)
 |    |    |    |    |    |    |    |-- multiSelect: boolean (nullable = true)
 |    |    |    |    |    |    |    |-- operator: string (nullable = true)
 |    |    |    |    |    |    |    |-- tableColumnId: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- 583607: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |-- advanced: boolean (nullable = true)
 |    |    |    |    |    |-- unifiedVersion: long (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- jsonQuery: struct (nullable = true)
 |    |    |    |    |-- order: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |    |-- columnFieldName: string (nullable = true)
 |    |    |    |-- rdfSubject: string (nullable = true)
 |    |    |    |-- renderTypeConfig: struct (nullable = true)
 |    |    |    |    |-- visible: struct (nullable = true)
 |    |    |    |    |    |-- table: boolean (nullable = true)
 |    |    |    |-- rowLabel: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- newBackend: boolean (nullable = true)
 |    |    |-- numberOfComments: long (nullable = true)
 |    |    |-- oid: long (nullable = true)
 |    |    |-- owner: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- provenance: string (nullable = true)
 |    |    |-- publicationAppendEnabled: boolean (nullable = true)
 |    |    |-- publicationDate: long (nullable = true)
 |    |    |-- publicationGroup: long (nullable = true)
 |    |    |-- publicationStage: string (nullable = true)
 |    |    |-- query: struct (nullable = true)
 |    |    |    |-- orderBys: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |-- expression: struct (nullable = true)
 |    |    |    |    |    |    |-- columnId: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- rights: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- rowsUpdatedAt: long (nullable = true)
 |    |    |-- rowsUpdatedBy: string (nullable = true)
 |    |    |-- tableAuthor: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- tableId: long (nullable = true)
 |    |    |-- tags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- totalTimesRated: long (nullable = true)
 |    |    |-- viewCount: long (nullable = true)
 |    |    |-- viewLastModified: long (nullable = true)
 |    |    |-- viewType: string (nullable = true)

Problem: All records getting wrapped up in single row and two column, i.e meta and data. Also with spark native JSON utility - Spark infers the schema (meatadata) automatically - and my expectation is it shouldn't explicitly as separate column on dataframe.

Expected Output JSON data set has following list of columns. It should show them in tabular format in dataframe where I can query them"

FACILITY, ADDRESS, LAST INSPECTED, VIOLATIONS,TOTAL  CRITICAL VIOLATIONS, TOTAL CRIT.  NOT CORRECTED, TOTAL  NONCRITICAL VIOLATIONS, DESCRIPTION, LOCAL HEALTH DEPARTMENT, COUNTY, FACILITY ADDRESS, CITY, ZIP CODE, NYSDOH GAZETTEER (1980), MUNICIPALITY, OPERATION NAME, PERMIT EXPIRATION DATE, PERMITTED  (D/B/A), PERMITTED  CORP. NAME,PERM. OPERATOR LAST NAME, PERM. OPERATOR LAST NAME, PERM. OPERATOR FIRST NAME, NYS HEALTH OPERATION ID, INSPECTION TYPE, INSPECTION COMMENTS, FOOD SERVICE FACILITY STATE, Location1

2nd Input DataSet: On site, it's first data set about funded projects by world bank

http://jsonstudio.com/resources/

(On site, it's first data set about funded projects by world bank)

It works all fine.

df=sqlContext.read.json("/user/train/wb.json")
df.count()
500 

2nd Input Data Sets works all but 1st Input dataset is not. My Observation is the way metadata defined is different for both Json files. In 1st. Meta data is defined first and then data however in 2nd file - meatadate is available with data on every line.

Can you please guide me on 1st input JSON file format and how to handle situation while converting it into pyspark dataframe?

Updated Outcome: After initial analysis we found that format seems wrong but community member helped an alternative way to read format. Marking answer as right and closing this thread.

Let me know if you need any further details, thanks in advance.

解决方案

Check out my notebook on databricks

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/753971180331031/2636709537891264/8469274372247319/latest.html

The first dataset is corrupt, i.e. it's not valid json and so spark can't read it.

But this was for spark 2.2.1

This is especially confusing because of the way this json file is organized The data is stored as a list of lists

df=spark.read.json("rows.json",multiLine=True)
data = df.select("data").collect()[0]['data']

And the column names are stored separately

column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

这篇关于Json 文件到 pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆