从pyspark中的字典列创建一个数据框 [英] Create a dataframe from column of dictionaries in pyspark

查看:60
本文介绍了从pyspark中的字典列创建一个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从pyspark中的现有数据框创建一个新的数据框.数据框"df"包含一列名为"data"的列,该列具有字典的行并且具有作为字符串的模式.而且每个字典的键都不是固定的,例如名称和地址是第一行字典的键,但其他行可能不是这样,它们可能有所不同.以下是该示例;

I want to create a new dataframe from existing dataframe in pyspark. The dataframe "df" contains a column named "data" which has rows of dictionary and has a schema as string. And the keys of each dictionary are not fixed.For example the name and address are the keys for the first row dictionary but that would not be the case for other rows they may be different. following is the example for that;

........................................................
  data 
........................................................
 {"name": "sam", "address":"uk"}
........................................................
{"name":"jack" , "address":"aus", "occupation":"job"}
.........................................................

如何转换为具有以下单独列的数据框.

How do I convert into the dataframe with individual columns like following.

 name   address    occupation
 sam       uk       
 jack      aus       job

推荐答案

data 转换为RDD,然后使用 spark.read.json 将RDD转换为带有架构的dataFrame.

Convert data to an RDD, then use spark.read.json to convert the RDD into a dataFrame with the schema.

data = [
    {"name": "sam", "address":"uk"}, 
    {"name":"jack" , "address":"aus", "occupation":"job"}
]

spark = SparkSession.builder.getOrCreate()
df = spark.read.json(sc.parallelize(data)).na.fill('') 
df.show()
+-------+----+----------+
|address|name|occupation|
+-------+----+----------+
|     uk| sam|          |
|    aus|jack|       job|
+-------+----+----------+

这篇关于从pyspark中的字典列创建一个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆