pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn't exist
问题描述
我在各种json文件中都有json数据,并且键的行可能不同,例如
I have json data in various json files And the keys could be different in lines, for eg
{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}
我想增加列'b','c','d'和'f'上的数据,这些数据在给定的json文件中不存在,但可能在其他文件中存在.因此,由于不存在"f"列,我们可以为该列取空字符串.
I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.
我正在读取输入文件并像这样汇总数据
I am reading the input file and aggregating the data like this
import pyspark.sql.functions as f
df = spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))
这是我想要的最终输出
{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}
任何人都可以帮助吗?预先感谢!
Can anyone please Help? Thanks in advance!
推荐答案
您可以检查数据框中是否有列,并且仅在必要时修改df
:
You can check if colum is available in dataframe and modify df
only if necessary:
if not 'f' in df.columns:
df = df.withColumn('f', f.lit(''))
对于嵌套模式,您可能需要使用df.schema
,如下所示:
For nested schemas you may need to use df.schema
like below:
>>> df.printSchema()
root
|-- a: struct (nullable = true)
| |-- b: long (nullable = true)
>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False
这篇关于pyspark数据框添加一列(如果不存在)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!