spark-在数据框中不存在列时设置为null [英] spark - set null when column not exist in dataframe

查看:140
本文介绍了spark-在数据框中不存在列时设置为null的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在加载许多版本的JSON文件以触发DataFrame.一些文件包含A,B列以及一些A,B,C或A,C.

I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C..

如果我运行此命令

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

df = sqlContext.sql("SELECT A,B,C FROM table")

在加载几个后,我会收到错误消息列不存在",我仅加载了不保存C列的文件.如何将此值设置为 null 而不出现错误?

after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error?

推荐答案

DataFrameReader.json method provides optional schema argument you can use here. If your schema is complex the simplest solution is to reuse one inferred from the file which contains all the fields:

df_complete = spark.read.json("complete_file")
schema = df_complete.schema

df_with_missing = spark.read.json("df_with_missing", schema)
# or
# spark.read.schema(schema).("df_with_missing")

如果您知道模式,但是由于某种原因您不能在上面使用,则必须从头开始创建它.

If you know schema but for some reason you cannot use above you have to create it from scratch.

schema = StructType([
    StructField("A", LongType(), True), ..., StructField("C", LongType(), True)])

一如既往,请确保在加载数据后进行一些质量检查.

As always be sure to perform some quality checks after loading your data.

示例(请注意,所有字段都是 nullable ):

Example (note that all fields are nullable):

from pyspark.sql.types import *

schema = StructType([
    StructField("x1", FloatType()),
    StructField("x2", StructType([
        StructField("y1", DoubleType()),
        StructField("y2", StructType([
            StructField("z1", StringType()),
            StructField("z2", StringType())
        ]))
    ])),
    StructField("x3", StringType()),
    StructField("x4", IntegerType())
])

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).printSchema()
## root
##  |-- x1: float (nullable = true)
##  |-- x2: struct (nullable = true)
##  |    |-- y1: double (nullable = true)
##  |    |-- y2: struct (nullable = true)
##  |    |    |-- z1: string (nullable = true)
##  |    |    |-- z2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- x4: integer (nullable = true)

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).first()
## Row(x1=None, x2=None, x3=None, x4=1)

spark.read.json(sc.parallelize(["""{"x3": "foo", "x1": 1.0}"""]), schema).first()
## Row(x1=1.0, x2=None, x3='foo', x4=None)

spark.read.json(sc.parallelize(["""{"x2": {"y2": {"z2": "bar"}}}"""]), schema).first()
## Row(x1=None, x2=Row(y1=None, y2=Row(z1=None, z2='bar')), x3=None, x4=None)

重要:

此方法仅适用于JSON源,具​​体取决于实现的细节.请勿将其用于Parquet之类的来源.

This method is applicable only to JSON source and depend on the detail of implementation. Don't use it for sources like Parquet.

这篇关于spark-在数据框中不存在列时设置为null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆