spark - 当数据框中不存在列时设置为 null [英] spark - set null when column not exist in dataframe

查看:21
本文介绍了spark - 当数据框中不存在列时设置为 null的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在加载许多版本的 JSON 文件来触发 DataFrame.一些文件包含 A、B 列和一些 A,B,C 或 A,C..

I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C..

如果我运行这个命令

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

df = sqlContext.sql("SELECT A,B,C FROM table")

加载几个后我会得到错误列不存在"我只加载了不包含 C 列的文件.如何将此值设置为 null 而不是出错?

after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error?

推荐答案

DataFrameReader.json 方法提供了您可以在此处使用的可选架构参数.如果您的架构很复杂,最简单的解决方案是重用从包含所有字段的文件中推断出的架构:

DataFrameReader.json method provides optional schema argument you can use here. If your schema is complex the simplest solution is to reuse one inferred from the file which contains all the fields:

df_complete = spark.read.json("complete_file")
schema = df_complete.schema

df_with_missing = spark.read.json("df_with_missing", schema)
# or
# spark.read.schema(schema).("df_with_missing")

如果您知道架构但由于某种原因无法在上面使用,则必须从头开始创建它.

If you know schema but for some reason you cannot use above you have to create it from scratch.

schema = StructType([
    StructField("A", LongType(), True), ..., StructField("C", LongType(), True)])

一如既往,请务必在加载数据后执行一些质量检查.

As always be sure to perform some quality checks after loading your data.

示例(注意所有字段都是可为空):

Example (note that all fields are nullable):

from pyspark.sql.types import *

schema = StructType([
    StructField("x1", FloatType()),
    StructField("x2", StructType([
        StructField("y1", DoubleType()),
        StructField("y2", StructType([
            StructField("z1", StringType()),
            StructField("z2", StringType())
        ]))
    ])),
    StructField("x3", StringType()),
    StructField("x4", IntegerType())
])

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).printSchema()
## root
##  |-- x1: float (nullable = true)
##  |-- x2: struct (nullable = true)
##  |    |-- y1: double (nullable = true)
##  |    |-- y2: struct (nullable = true)
##  |    |    |-- z1: string (nullable = true)
##  |    |    |-- z2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- x4: integer (nullable = true)

spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).first()
## Row(x1=None, x2=None, x3=None, x4=1)

spark.read.json(sc.parallelize(["""{"x3": "foo", "x1": 1.0}"""]), schema).first()
## Row(x1=1.0, x2=None, x3='foo', x4=None)

spark.read.json(sc.parallelize(["""{"x2": {"y2": {"z2": "bar"}}}"""]), schema).first()
## Row(x1=None, x2=Row(y1=None, y2=Row(z1=None, z2='bar')), x3=None, x4=None)

重要:

此方法仅适用于 JSON 源,并取决于实现的细节.不要将其用于 Parquet 等来源.

This method is applicable only to JSON source and depend on the detail of implementation. Don't use it for sources like Parquet.

这篇关于spark - 当数据框中不存在列时设置为 null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆