pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn't exist

查看:141
本文介绍了pyspark数据框添加一列(如果不存在)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在各种json文件中都有json数据,并且键的行可能不同,例如

I have json data in various json files And the keys could be different in lines, for eg

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}

我想增加列'b','c','d'和'f'上的数据,这些数据在给定的json文件中不存在,但可能在其他文件中存在.因此,由于不存在"f"列,我们可以为该列取空字符串.

I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.

我正在读取输入文件并像这样汇总数据

I am reading the input file and aggregating the data like this

import pyspark.sql.functions as f
df =  spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

这是我想要的最终输出

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}

任何人都可以帮助吗?预先感谢!

Can anyone please Help? Thanks in advance!

推荐答案

您可以检查数据框中是否有列,并且仅在必要时修改df:

You can check if colum is available in dataframe and modify df only if necessary:

if not 'f' in df.columns:
   df = df.withColumn('f', f.lit(''))

对于嵌套模式,您可能需要使用df.schema,如下所示:

For nested schemas you may need to use df.schema like below:

>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 |    |-- b: long (nullable = true)

>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False

这篇关于pyspark数据框添加一列(如果不存在)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆