将嵌套的字典键值转换为pyspark数据框 [英] Transform nested dictionary key values to pyspark dataframe

查看:105
本文介绍了将嵌套的字典键值转换为pyspark数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的Pyspark数据框:

I have a Pyspark dataframe that looks like this:

我会像在 dic中提取那些嵌套字典。列并将其转换为PySpark数据框。像这样:

I would like extract those nested dictionaries in the "dic" column and transform them into PySpark dataframe. Like this:

请让我知道如何实现这一目标。

Please let me know how I can achieve this.

谢谢!

推荐答案

from pyspark.sql import functions as F

df.show() #sample dataframe

+---------+----------------------------------------------------------------------------------------------------------+
|timestmap|dic                                                                                                       |
+---------+----------------------------------------------------------------------------------------------------------+
|timestamp|{"Name":"David","Age":"25","Location":"New York","Height":"170","fields":{"Color":"Blue","Shape":"round"}}|
+---------+----------------------------------------------------------------------------------------------------------+

对于 Spark2.4 + ,则可以使用 from_json schema_of_json

For Spark2.4+, you could use from_json and schema_of_json.

schema=df.select(F.schema_of_json(df.select("dic").first()[0])).first()[0]


df.withColumn("dic", F.from_json("dic", schema))\
  .selectExpr("dic.*").selectExpr("*","fields.*").drop("fields").show()

#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25|   170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+

您也可以将 rdd 方式与 read.json (如果您没有 spark2.4 )。 df到rdd 的转换会受到性能影响。

You could also use rdd way with read.json if you don't have spark2.4. There will be performance hit of df to rdd conversion.

df1 = spark.read.json(df.rdd.map(lambda r: r.dic))\
   
df1.select(*[x for x in df1.columns if x!='fields'], F.col("fields.*")).show()

#+---+------+--------+-----+-----+-----+
#|Age|Height|Location| Name|Color|Shape|
#+---+------+--------+-----+-----+-----+
#| 25|   170|New York|David| Blue|round|
#+---+------+--------+-----+-----+-----+

这篇关于将嵌套的字典键值转换为pyspark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆