pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn't exist

查看：141 发布时间：2020/9/4 3:06:32 apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了pyspark数据框添加一列(如果不存在)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在各种json文件中都有json数据，并且键的行可能不同，例如

I have json data in various json files And the keys could be different in lines, for eg

{"a":1 , "b":"abc", "c":"abc2", "d":"abc3"}
{"a":1 , "b":"abc2", "d":"abc"}
{"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"}

我想增加列'b'，'c'，'d'和'f'上的数据，这些数据在给定的json文件中不存在，但可能在其他文件中存在.因此，由于不存在"f"列，我们可以为该列取空字符串.

I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column.

我正在读取输入文件并像这样汇总数据

I am reading the input file and aggregating the data like this

import pyspark.sql.functions as f
df =  spark.read.json(inputfile)
df2 =df.groupby("b","c","d","f").agg(f.sum(df["a"]))

这是我想要的最终输出

{"a":2 , "b":"abc", "c":"abc2", "d":"abc3","f":"" }
{"a":1 , "b":"abc2", "c":"" ,"d":"abc","f":""}

任何人都可以帮助吗?预先感谢！

Can anyone please Help? Thanks in advance!

推荐答案

您可以检查数据框中是否有列，并且仅在必要时修改df:

You can check if colum is available in dataframe and modify df only if necessary:

if not 'f' in df.columns:
   df = df.withColumn('f', f.lit(''))

对于嵌套模式，您可能需要使用df.schema，如下所示:

For nested schemas you may need to use df.schema like below:

>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 |    |-- b: long (nullable = true)

>>> 'b' in df.schema['a'].dataType.names
True
>>> 'x' in df.schema['a'].dataType.names
False

这篇关于pyspark数据框添加一列(如果不存在)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn't exist

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn&#39;t exist

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

pyspark数据框添加一列(如果不存在) [英] pyspark dataframe add a column if it doesn't exist

登录关闭