Flatten Json在Pyspark [英] Flatten Json in Pyspark

查看:41
本文介绍了Flatten Json在Pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

my_data=[
    {'stationCode': 'NB001',
       'summaries': [{'period': {'year': 2017}, 'rainfall': 449},
        {'period': {'year': 2018}, 'rainfall': 352.4},
        {'period': {'year': 2019}, 'rainfall': 253.2},
        {'period': {'year': 2020}, 'rainfall': 283},
        {'period': {'year': 2021}, 'rainfall': 104.2}]},
    {'stationCode': 'NA003',
       'summaries': [{'period': {'year': 2019}, 'rainfall': 58.2},
        {'period': {'year': 2020}, 'rainfall': 628.2},
        {'period': {'year': 2021}, 'rainfall': 120}]}]

在熊猫中,我可以:

import pandas as pd
from pandas import json_normalize
pd.concat([json_normalize(entry, 'summaries', 'stationCode') 
                     for entry in my_data])

这将为我提供下表:

    rainfall  period.year stationCode
0     449.0         2017       NB001
1     352.4         2018       NB001
2     253.2         2019       NB001
3     283.0         2020       NB001
4     104.2         2021       NB001
0      58.2         2019       NA003
1     628.2         2020       NA003
2     120.0         2021       NA003

可以在pyspark中的一行代码中实现吗?

Can this be achieved in one line of code in pyspark?

我尝试了下面的代码,它给我相同的结果.但是,它太长了,有没有办法缩短它?;

I have tried the code below and it gives me the same result. However, it is too long, is there a way to shorten it?;

df=sc.parallelize(my_data)
df1=spark.read.json(df)


  df1.select("stationCode","summaries.period.year","summaries.rainfall").display()
  df1 = df1.withColumn("year_rainfall", F.arrays_zip("year", "rainfall"))
           .withColumn("year_rainfall", F.explode("year_rainfall"))
           .select("stationCode", 
               F.col("year_rainfall.rainfall").alias("Rainfall"), 
               F.col("year_rainfall.year").alias("Year"))
  df1.display(20, False)

将自己介绍给pyspark,因此将非常感谢您提供一些解释或良好的信息来源

Introducing myself to pyspark and so some explanation or good information sources will highly be appreciated

推荐答案

您的内容对我来说很好,而且可读.但是,您也可以直接压缩并爆炸:

What you have looks fine to me and is readable. However you can also zip and explode directly:

out = (df1.select("stationCode", 
      F.explode(F.arrays_zip(*["summaries.period.year","summaries.rainfall"])))
.select("stationCode",F.col("col")['0'].alias("year"),F.col("col")['1'].alias("rainfall")))


out.show()

+-----------+----+--------+
|stationCode|year|rainfall|
+-----------+----+--------+
|      NB001|2017|   449.0|
|      NB001|2018|   352.4|
|      NB001|2019|   253.2|
|      NB001|2020|   283.0|
|      NB001|2021|   104.2|
|      NA003|2019|    58.2|
|      NA003|2020|   628.2|
|      NA003|2021|   120.0|
+-----------+----+--------+

这篇关于Flatten Json在Pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆