Json文件到pyspark数据帧 [英] Json file to pyspark dataframe

查看:46
本文介绍了Json文件到pyspark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在spark(pyspark)环境中使用JSON文件.

I'm trying to work with JSON file on spark (pyspark) environment.

问题:无法在Pyspark Dataframe中将JSON转换为预期格式

Problem: Unable to convert JSON to expected format in Pyspark Dataframe

第一个输入数据集:

https://health.data.ny.gov /api/views/cnih-y5dw/rows.json

在此文件中,元数据是在文件的开头定义的,其标签为元".然后是带有标签"data"的数据.

In this file metadata is defined at start for of the file with tag "meta" and then followed by data with tag "data".

仅供参考:将数据从Web下载到本地驱动器的步骤. 1.我已将文件下载到本地驱动器2.然后推送到hdfs-从那里我正在将其读取到Spark环境.

FYI: Steps taken to download data from web to local drive. 1. I've downloaded file to my local drive 2. then pushed to hdfs - from there I'm reading it to spark environment.

df=sqlContext.read.json("/user/train/ny.json",multiLine=True)
df.count()
out[5]: 1
df.show()

df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- attribution: string (nullable = true)
 |    |    |-- attributionLink: string (nullable = true)
 |    |    |-- averageRating: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- columns: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cachedContents: struct (nullable = true)
 |    |    |    |    |    |-- average: string (nullable = true)
 |    |    |    |    |    |-- largest: string (nullable = true)
 |    |    |    |    |    |-- non_null: long (nullable = true)
 |    |    |    |    |    |-- null: long (nullable = true)
 |    |    |    |    |    |-- smallest: string (nullable = true)
 |    |    |    |    |    |-- sum: string (nullable = true)
 |    |    |    |    |    |-- top: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- count: long (nullable = true)
 |    |    |    |    |    |    |    |-- item: string (nullable = true)
 |    |    |    |    |-- dataTypeName: string (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- fieldName: string (nullable = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- format: struct (nullable = true)
 |    |    |    |    |    |-- align: string (nullable = true)
 |    |    |    |    |    |-- mask: string (nullable = true)
 |    |    |    |    |    |-- noCommas: string (nullable = true)
 |    |    |    |    |    |-- precisionStyle: string (nullable = true)
 |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- position: long (nullable = true)
 |    |    |    |    |-- renderTypeName: string (nullable = true)
 |    |    |    |    |-- tableColumnId: long (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |-- createdAt: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- displayType: string (nullable = true)
 |    |    |-- downloadCount: long (nullable = true)
 |    |    |-- flags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- grants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- inherited: boolean (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- hideFromCatalog: boolean (nullable = true)
 |    |    |-- hideFromDataJson: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- indexUpdatedAt: long (nullable = true)
 |    |    |-- metadata: struct (nullable = true)
 |    |    |    |-- attachments: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assetId: string (nullable = true)
 |    |    |    |    |    |-- blobId: string (nullable = true)
 |    |    |    |    |    |-- filename: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- availableDisplayTypes: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- custom_fields: struct (nullable = true)
 |    |    |    |    |-- Additional Resources: struct (nullable = true)
 |    |    |    |    |    |-- See Also: string (nullable = true)
 |    |    |    |    |-- Dataset Information: struct (nullable = true)
 |    |    |    |    |    |-- Agency: string (nullable = true)
 |    |    |    |    |-- Dataset Summary: struct (nullable = true)
 |    |    |    |    |    |-- Contact Information: string (nullable = true)
 |    |    |    |    |    |-- Coverage: string (nullable = true)
 |    |    |    |    |    |-- Data Frequency: string (nullable = true)
 |    |    |    |    |    |-- Dataset Owner: string (nullable = true)
 |    |    |    |    |    |-- Granularity: string (nullable = true)
 |    |    |    |    |    |-- Organization: string (nullable = true)
 |    |    |    |    |    |-- Posting Frequency: string (nullable = true)
 |    |    |    |    |    |-- Time Period: string (nullable = true)
 |    |    |    |    |    |-- Units: string (nullable = true)
 |    |    |    |    |-- Disclaimers: struct (nullable = true)
 |    |    |    |    |    |-- Limitations: string (nullable = true)
 |    |    |    |    |-- Local Data: struct (nullable = true)
 |    |    |    |    |    |-- County Filter: string (nullable = true)
 |    |    |    |    |    |-- County_Column: string (nullable = true)
 |    |    |    |-- filterCondition: struct (nullable = true)
 |    |    |    |    |-- children: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |    |    |-- includeAuto: long (nullable = true)
 |    |    |    |    |    |    |    |-- multiSelect: boolean (nullable = true)
 |    |    |    |    |    |    |    |-- operator: string (nullable = true)
 |    |    |    |    |    |    |    |-- tableColumnId: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- 583607: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |-- advanced: boolean (nullable = true)
 |    |    |    |    |    |-- unifiedVersion: long (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- jsonQuery: struct (nullable = true)
 |    |    |    |    |-- order: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |    |-- columnFieldName: string (nullable = true)
 |    |    |    |-- rdfSubject: string (nullable = true)
 |    |    |    |-- renderTypeConfig: struct (nullable = true)
 |    |    |    |    |-- visible: struct (nullable = true)
 |    |    |    |    |    |-- table: boolean (nullable = true)
 |    |    |    |-- rowLabel: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- newBackend: boolean (nullable = true)
 |    |    |-- numberOfComments: long (nullable = true)
 |    |    |-- oid: long (nullable = true)
 |    |    |-- owner: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- provenance: string (nullable = true)
 |    |    |-- publicationAppendEnabled: boolean (nullable = true)
 |    |    |-- publicationDate: long (nullable = true)
 |    |    |-- publicationGroup: long (nullable = true)
 |    |    |-- publicationStage: string (nullable = true)
 |    |    |-- query: struct (nullable = true)
 |    |    |    |-- orderBys: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |-- expression: struct (nullable = true)
 |    |    |    |    |    |    |-- columnId: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- rights: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- rowsUpdatedAt: long (nullable = true)
 |    |    |-- rowsUpdatedBy: string (nullable = true)
 |    |    |-- tableAuthor: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- tableId: long (nullable = true)
 |    |    |-- tags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- totalTimesRated: long (nullable = true)
 |    |    |-- viewCount: long (nullable = true)
 |    |    |-- viewLastModified: long (nullable = true)
 |    |    |-- viewType: string (nullable = true)

问题: 所有记录都包装在单行和两列中,即元数据和数据中.借助spark本机JSON实用程序-Spark会自动推断架构(元数据)-我期望它不应该明确地作为dataframe上的单独列.

Problem: All records getting wrapped up in single row and two column, i.e meta and data. Also with spark native JSON utility - Spark infers the schema (meatadata) automatically - and my expectation is it shouldn't explicitly as separate column on dataframe.

预期产量 JSON数据集具有以下列列表.它应该以表格格式在数据框中显示它们,我可以在其中查询它们"

Expected Output JSON data set has following list of columns. It should show them in tabular format in dataframe where I can query them"

FACILITY, ADDRESS, LAST INSPECTED, VIOLATIONS,TOTAL  CRITICAL VIOLATIONS, TOTAL CRIT.  NOT CORRECTED, TOTAL  NONCRITICAL VIOLATIONS, DESCRIPTION, LOCAL HEALTH DEPARTMENT, COUNTY, FACILITY ADDRESS, CITY, ZIP CODE, NYSDOH GAZETTEER (1980), MUNICIPALITY, OPERATION NAME, PERMIT EXPIRATION DATE, PERMITTED  (D/B/A), PERMITTED  CORP. NAME,PERM. OPERATOR LAST NAME, PERM. OPERATOR LAST NAME, PERM. OPERATOR FIRST NAME, NYS HEALTH OPERATION ID, INSPECTION TYPE, INSPECTION COMMENTS, FOOD SERVICE FACILITY STATE, Location1

第二个输入数据集: 在现场,这是有关世界银行资助项目的第一个数据集

2nd Input DataSet: On site, it's first data set about funded projects by world bank

http://jsonstudio.com/resources/

(在现场,这是有关世界银行资助项目的第一个数据集)

(On site, it's first data set about funded projects by world bank)

一切正常.

df=sqlContext.read.json("/user/train/wb.json")
df.count()
500 

第二输入数据集可以工作,但第一输入数据集不能工作.我的观察是两个Json文件定义的元数据不同的方式.在第一.首先定义元数据,然后定义数据,但在第二个文件中-在每行数据中都可以使用Meatadate.

2nd Input Data Sets works all but 1st Input dataset is not. My Observation is the way metadata defined is different for both Json files. In 1st. Meta data is defined first and then data however in 2nd file - meatadate is available with data on every line.

能否请您指导我第一种输入JSON文件格式,以及在将其转换为pyspark数据帧时如何处理情况?

Can you please guide me on 1st input JSON file format and how to handle situation while converting it into pyspark dataframe?

更新后的结果:经过初步分析,我们发现格式似乎有误,但社区成员提供了另一种读取格式的方法.将答案标记为正确并关闭该线程.

Updated Outcome: After initial analysis we found that format seems wrong but community member helped an alternative way to read format. Marking answer as right and closing this thread.

请让我知道是否需要其他详细信息.

Let me know if you need any further details, thanks in advance.

推荐答案

数据砖

https://databricks -prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/753971180331031/2636709537891264/8469274372247319/latest.html

第一个数据集已损坏,即无效json,因此spark无法读取它.

The first dataset is corrupt, i.e. it's not valid json and so spark can't read it.

但这是用于spark 2.2.1

But this was for spark 2.2.1

由于此json文件的组织方式,这尤其令人困惑 数据存储为列表列表

This is especially confusing because of the way this json file is organized The data is stored as a list of lists

df=spark.read.json("rows.json",multiLine=True)
data = df.select("data").collect()[0]['data']

列名分别存储

column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

这篇关于Json文件到pyspark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆