pyspark 数据框如果不存在则添加一列 [英] pyspark dataframe add a column if it doesn't exist
问题描述
我在各种 json 文件中有 json 数据,并且键可能在行中不同,例如
{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}{"a":1, "b":"abc2", "d":"abc"}{"a":1,"b":"abc", "c":"abc2", "d":"abc3"}
我想聚合列 'b'、'c'、'd' 和 'f' 上的数据,这些数据不存在于给定的 json 文件中,但可能存在于其他文件中.因此,由于列 'f' 不存在,我们可以为该列取空字符串.
我正在读取输入文件并像这样聚合数据
import pyspark.sql.functions as fdf = spark.read.json(输入文件)df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))
这是我想要的最终输出
{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }{"a":1, "b":"abc2", "c":"" ,"d":"abc","f":""}
有人可以帮忙吗?提前致谢!
您可以检查数据帧中的列是否可用,并仅在必要时修改 df
:
如果在 df.columns 中不是 'f':df = df.withColumn('f', f.lit(''))
对于嵌套模式,您可能需要使用 df.schema
,如下所示:
I have json data in various json files And the keys could be different in lines, for eg
{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}
I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.
I am reading the input file and aggregating the data like this
import pyspark.sql.functions as f
df = spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))
This is the final output I want
{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}
Can anyone please Help? Thanks in advance!
You can check if colum is available in dataframe and modify df
only if necessary:
if not 'f' in df.columns:
df = df.withColumn('f', f.lit(''))
For nested schemas you may need to use df.schema
like below:
>>> df.printSchema()
root
|-- a: struct (nullable = true)
| |-- b: long (nullable = true)
>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False
这篇关于pyspark 数据框如果不存在则添加一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!