Pyspark访问和分解JSON的嵌套项 [英] Pyspark accessing and exploding nested items of a json

查看：516 发布时间：2019/11/26 20:56:07 python json pyspark

本文介绍了Pyspark访问和分解JSON的嵌套项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是火花的新手，我试图解析一个包含要聚合数据的json文件，但我无法导航它的内容. 我在寻找其他解决方案，但找不到任何适合我的情况.

I'm very new to spark and i'm trying to parse a json file containing data to be aggregated but i can't manage to navigate its content. I searched for for other solutions but i wasn't able to find anything that worked in my case.

这是导入的json的数据框的架构:

This is the schema of the dataframe of imported json:

root
  |-- UrbanDataset: struct (nullable = true)
  |    |-- context: struct (nullable = true)
  |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |-- format: string (nullable = true)
  |    |    |    |-- height: long (nullable = true)
  |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |-- longitude: double (nullable = true)
  |    |    |-- language: string (nullable = true)
  |    |    |-- producer: struct (nullable = true)
  |    |    |    |-- id: string (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |-- timeZone: string (nullable = true)
  |    |    |-- timestamp: string (nullable = true)
  |    |-- specification: struct (nullable = true)
  |    |    |-- id: struct (nullable = true)
  |    |    |    |-- schemeID: string (nullable = true)
  |    |    |    |-- value: string (nullable = true)
  |    |    |-- name: string (nullable = true)
  |    |    |-- properties: struct (nullable = true)
  |    |    |    |-- propertyDefinition: array (nullable = true)
  |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |-- codeList: string (nullable = true)
  |    |    |    |    |    |-- dataType: string (nullable = true)
  |    |    |    |    |    |-- propertyDescription: string (nullable = true)
  |    |    |    |    |    |-- propertyName: string (nullable = true)
  |    |    |    |    |    |-- subProperties: struct (nullable = true)
  |    |    |    |    |    |    |-- propertyName: array (nullable = true)
  |    |    |    |    |    |    |    |-- element: string (containsNull = true)
  |    |    |    |    |    |-- unitOfMeasure: string (nullable = true)
  |    |    |-- uri: string (nullable = true)
  |    |    |-- version: string (nullable = true)
  |    |-- values: struct (nullable = true)
  |    |    |-- line: array (nullable = true)
  |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |-- coordinates: struct (nullable = true)
  |    |    |    |    |    |-- format: string (nullable = true)
  |    |    |    |    |    |-- height: double (nullable = true)
  |    |    |    |    |    |-- latitude: double (nullable = true)
  |    |    |    |    |    |-- longitude: double (nullable = true)
  |    |    |    |    |-- id: long (nullable = true)
  |    |    |    |    |-- period: struct (nullable = true)
  |    |    |    |    |    |-- end_ts: string (nullable = true)
  |    |    |    |    |    |-- start_ts: string (nullable = true)
  |    |    |    |    |-- property: array (nullable = true)
  |    |    |    |    |    |-- element: struct (containsNull = true)
  |    |    |    |    |    |    |-- name: string (nullable = true)
  |    |    |    |    |    |    |-- val: string (nullable = true)

整个json的子集被附加到此处

A subset of the whole json is attached here

我的目标是从此架构中检索 values 结构，并操作/汇总位于line.element.property.element.val

My goal is to retrieve the values struct from this schema and manipulating/aggregating all the val located in line.element.property.element.val

我也尝试将其分解以获取"csv样式"列中的每个字段，但出现错误:

I tried also to explode it to get every field in a column "csv style" but i got the error:

pyspark.sql.utils.AnalysisException:u"由于数据类型不匹配而无法解析'array(UrbanDataset.context，UrbanDataset.specification，UrbanDataset.values)':输入函数数组应该都是相同的类型

pyspark.sql.utils.AnalysisException: u"cannot resolve 'array(UrbanDataset.context, UrbanDataset.specification, UrbanDataset.values)' due to data type mismatch: input to function array should all be the same type

import pyspark
import pyspark.sql.functions as psf

df = spark.read.format('json').load('data1.json')
df.select(psf.explode(psf.array("UrbanDataset.*"))).show()

谢谢

Pyspark访问和分解JSON的嵌套项 [英] Pyspark accessing and exploding nested items of a json

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark访问和分解JSON的嵌套项 [英] Pyspark accessing and exploding nested items of a json

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭