PySpark:StructField(..., ..., False) 总是返回 `nullable=true` 而不是 `nullable=false` [英] PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

查看:46
本文介绍了PySpark:StructField(..., ..., False) 总是返回 `nullable=true` 而不是 `nullable=false`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 PySpark 的新手,正面临一个奇怪的问题.我正在尝试在加载 CSV 数据集时将某些列设置为不可为空.我可以用一个非常小的数据集 (test.csv) 重现我的案例:

I'm new to PySpark and am facing a strange problem. I'm trying to set some column to non-nullable while loading a CSV dataset. I can reproduce my case with a very small dataset (test.csv):

col1,col2,col3
11,12,13
21,22,23
31,32,33
41,42,43
51,,53

第 5 行第 2 列有一个空值,我不想在我的 DF 中获取该行.我将所有字段设置为不可为空 (nullable=false),但我得到了一个模式,其中所有三列都具有 nullable=true.即使我将所有三列都设置为不可为空,也会发生这种情况!我正在运行最新的 Spark 可用版本 2.0.1.

There is a null value at row 5, column 2 and I don't want to get that row inside my DF. I set all fields as non-nullable (nullable=false) but I get a schema with all the three columns having nullable=true. This happens even if I set all the three columns as non-nullable! I'm running the latest available version of Spark, 2.0.1.

代码如下:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

struct = StructType([   StructField("col1", StringType(), False), \
                        StructField("col2", StringType(), False), \
                        StructField("col3", StringType(), False) \
                    ])

df = spark.read.load("test.csv", schema=struct, format="csv", header="true")

df.printSchema() 返回:

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)

df.show() 返回:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
|  51|null|  53|
+----+----+----+

虽然我期待:

root
 |-- col1: string (nullable = false)
 |-- col2: string (nullable = false)
 |-- col3: string (nullable = false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
+----+----+----+

推荐答案

虽然 Spark 行为(从 False 切换到 True 在这里令人困惑,但没有任何根本性的错误发生nullable 参数不是约束,而是源和类型语义的反映,它支持某些类型的优化

While Spark behavior (switch from False to True here is confusing there is nothing fundamentally wrong going on here. nullable argument is not a constraint but a reflection of the source and type semantics which enables certain types of optimization

您声明要避免数据中出现空值.为此,您应该使用 na.drop 方法.

You state that you want to avoid null values in your data. For this you should use na.drop method.

df.na.drop()

有关处理空值的其他方法,请查看 DataFrameNaFunctions(使用 DataFrame.na 属性公开)文档.

For other ways of handling nulls please take a look at the DataFrameNaFunctions (exposed using DataFrame.na property) documentation.

CSV 格式不提供任何允许您指定数据约束的工具,因此根据定义,读者不能假设输入不为空,而您的数据确实包含空值.

CSV format doesn't provide any tools which allow you to specify data constraints so by definition reader cannot assume that input is not null and your data indeed contains nulls.

这篇关于PySpark:StructField(..., ..., False) 总是返回 `nullable=true` 而不是 `nullable=false`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆