将嵌套的 Json 转换为 Pyspark 中的数据帧 [英] convert a Nested Json to a dataframe in Pyspark

查看:73
本文介绍了将嵌套的 Json 转换为 Pyspark 中的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从具有嵌套字段和日期字段的 json 创建数据框,我想将其连接:

root|-- 模型:字符串(可为空 = 真)|-- 代码:字符串(可为空 = 真)|-- START_Time: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 权重:字符串(可为空 = 真)|-- 注册: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 总计:字符串(可为空 = 真)|-- 计划:结构(可为空 = 真)||-- 天:长(可为空 = 真)||-- 小时:长(可为空 = 真)||-- 分钟:长(可为空 = 真)||-- 月:长(可为空 = 真)||-- 秒:长(可为空 = 真)||-- 年:长(可为空 = 真)|-- 包:字符串(可为空 = 真)

目标是获得更像:

+---------+------------------+----------+-----------------+----------+-----------------+|型号 |START_时间 |重量 |已登记 |总计 |已安排 |+---------+----------+----------+-----------------+----------+------------------+|.........|yy-mm-dd-hh-mm-ss|重量 |yy-mm-dd-hh-mm-ss|总计 |yy-mm-dd-hh-mm-ss|

其中 yy-mm-dd-hh-mm-ss 是 json 中的日、小时、分钟....

|-- 例子:struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)

我尝试过explode功能可能没有按预期使用但没有用谁能激励我寻求解决方案谢谢

解决方案

您可以通过以下简单步骤完成.

  1. 让我们在 data.json 文件中有如下数据

<块引用>

{MODEL":abc",CODE":CODE1",START_Time":{day":05",hour":08",分钟":30",月":08",秒":30",年":21"},重量":231",REGISTED":{day":05",hour":08",minute":30",month":08",second";: "30", "year": "21"}, "TOTAL": "1", "SCHEDULED": {"day": "05", "hour": ";08"、分钟":30"、月":08"、秒":30"、年":21"}、包":汽车"}

此数据与您共享的架构相同.

  1. 在 pyspark 中读取这个 json 文件,如下所示.

    from pyspark.sql.functions import *df = spark.read.json('data.json')

  2. 现在您可以读取嵌套值并修改列值,如下所示.

    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).表演()

输出为

<头>
代码型号包装注册预定START_时间总计重量
CODE1abc汽车21-08-05-08-30-3021-08-05-08-30-3021-08-05-08-30-301231

I'm trying to create a dataframe from a json with nested feilds and dates feilds that i'd like to concatenate :

root
 |-- MODEL: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- START_Time: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- WEIGHT: string (nullable = true)
 |-- REGISTED: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)
 |-- TOTAL: string (nullable = true)
 |-- SCHEDULED: struct (nullable = true)
 |    |-- day: long (nullable = true)
 |    |-- hour: long (nullable = true)
 |    |-- minute: long (nullable = true)
 |    |-- month: long (nullable = true)
 |    |-- second: long (nullable = true)
 |    |-- year: long (nullable = true)
 |-- PACKAGE: string (nullable = true)

objective is to get a result more like :

+---------+------------------+----------+-----------------+---------+-----------------+
|MODEL    |   START_Time     | WEIGHT   |REGISTED         |TOTAL    |SCHEDULED        |   
+---------+------------------+----------+-----------------+---------+-----------------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT   |yy-mm-dd-hh-mm-ss|TOTAL    |yy-mm-dd-hh-mm-ss| 

where yy-mm-dd-hh-mm-ss are the conactenation of: day, hour, minute.... in the json

|-- example: struct (nullable = true)
 |    |-- day: string (nullable = true)
 |    |-- hour: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- month: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- year: string (nullable = true)

i have tried explode function may be didn't use it as it should but didn't work can anyone inspire me for a solution Thank you

解决方案

You can do it in below simple steps.

  1. Lets we have the data as below in the data.json file

{"MODEL": "abc", "CODE": "CODE1", "START_Time": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"}, "WEIGHT": "231", "REGISTED": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"}, "TOTAL": "1", "SCHEDULED": {"day": "05", "hour": "08", "minute": "30", "month": "08", "second": "30", "year": "21"},"PACKAGE": "CAR"}

This data has the same schema as you shared.

  1. Read this json file in pyspark as below.

    from pyspark.sql.functions import *
    
    df = spark.read.json('data.json')
    

  2. Now you can read the nested values and modify the column values as below.

    df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), col('START_Time.month'), lit('-'), col('START_Time.day'), lit('-'), col('START_Time.hour'), lit('-'), col('START_Time.minute'), lit('-'), col('START_Time.second'))).withColumn('REGISTED',concat(col('REGISTED.year'), lit('-'), col('REGISTED.month'), lit('-'), col('REGISTED.day'), lit('-'), col('REGISTED.hour'), lit('-'), col('REGISTED.minute'), lit('-'), col('REGISTED.second'))).withColumn('SCHEDULED',concat(col('SCHEDULED.year'), lit('-'), col('SCHEDULED.month'), lit('-'), col('SCHEDULED.day'), lit('-'), col('SCHEDULED.hour'), lit('-'), col('SCHEDULED.minute'), lit('-'), col('SCHEDULED.second'))).show()
    

The output would be

CODE MODEL PACKAGE REGISTED SCHEDULED START_Time TOTAL WEIGHT
CODE1 abc CAR 21-08-05-08-30-30 21-08-05-08-30-30 21-08-05-08-30-30 1 231

这篇关于将嵌套的 Json 转换为 Pyspark 中的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆