pyspark 数据框如果不存在则添加一列 [英] pyspark dataframe add a column if it doesn't exist

查看:43
本文介绍了pyspark 数据框如果不存在则添加一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在各种 json 文件中有 json 数据,并且键可能在行中不同,例如

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}{"a":1, "b":"abc2", "d":"abc"}{"a":1,"b":"abc", "c":"abc2", "d":"abc3"}

我想聚合列 'b'、'c'、'd' 和 'f' 上的数据,这些数据不存在于给定的 json 文件中,但可能存在于其他文件中.因此,由于列 'f' 不存在,我们可以为该列取空字符串.

我正在读取输入文件并像这样聚合数据

import pyspark.sql.functions as fdf = spark.read.json(输入文件)df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

这是我想要的最终输出

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }{"a":1, "b":"abc2", "c":"" ,"d":"abc","f":""}

有人可以帮忙吗?提前致谢!

解决方案

您可以检查数据帧中的列是否可用,并仅在必要时修改 df :

如果在 df.columns 中不是 'f':df = df.withColumn('f', f.lit(''))

对于嵌套模式,您可能需要使用 df.schema,如下所示:

<预><代码>>>>df.printSchema()根|-- a: struct (nullable = true)||-- b: long (nullable = true)>>>df.schema['a'].dataType.names 中的'b'真的>>>df.schema['a'].dataType.names 中的'x'错误的

I have json data in various json files And the keys could be different in lines, for eg

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}

I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.

I am reading the input file and aggregating the data like this

import pyspark.sql.functions as f
df =  spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

This is the final output I want

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}

Can anyone please Help? Thanks in advance!

解决方案

You can check if colum is available in dataframe and modify df only if necessary:

if not 'f' in df.columns:
   df = df.withColumn('f', f.lit(''))

For nested schemas you may need to use df.schema like below:

>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 |    |-- b: long (nullable = true)

>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False

这篇关于pyspark 数据框如果不存在则添加一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆