PySpark:StructField(...,...,False)总是返回`nullable = true`而不是`nullable = false` [英] PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

查看:773
本文介绍了PySpark:StructField(...,...,False)总是返回`nullable = true`而不是`nullable = false`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是PySpark的新手,正面临一个奇怪的问题.我正在尝试在加载CSV数据集时将某些列设置为不可为空.我可以使用非常小的数据集(test.csv)复制我的案例:

I'm new to PySpark and am facing a strange problem. I'm trying to set some column to non-nullable while loading a CSV dataset. I can reproduce my case with a very small dataset (test.csv):

col1,col2,col3
11,12,13
21,22,23
31,32,33
41,42,43
51,,53

第5行第2列有一个空值,我不想在DF中获取该行.我将所有字段都设置为不可为空(nullable=false),但是我得到了一个架构,其中所有三列都具有nullable=true.即使我将所有三列都设置为不可为空,也会发生这种情况!我正在运行Spark的最新可用版本2.0.1.

There is a null value at row 5, column 2 and I don't want to get that row inside my DF. I set all fields as non-nullable (nullable=false) but I get a schema with all the three columns having nullable=true. This happens even if I set all the three columns as non-nullable! I'm running the latest available version of Spark, 2.0.1.

代码如下:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

struct = StructType([   StructField("col1", StringType(), False), \
                        StructField("col2", StringType(), False), \
                        StructField("col3", StringType(), False) \
                    ])

df = spark.read.load("test.csv", schema=struct, format="csv", header="true")

df.printSchema()返回:

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)

df.show()返回:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
|  51|null|  53|
+----+----+----+

我希望这是

root
 |-- col1: string (nullable = false)
 |-- col2: string (nullable = false)
 |-- col3: string (nullable = false)

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  11|  12|  13|
|  21|  22|  23|
|  31|  32|  33|
|  41|  42|  43|
+----+----+----+

推荐答案

虽然Spark行为(此处从False切换到True令人困惑,但这里根本没有错.nullable参数不是约束但是反映了源和类型语义,可以实现某些类型的优化

While Spark behavior (switch from False to True here is confusing there is nothing fundamentally wrong going on here. nullable argument is not a constraint but a reflection of the source and type semantics which enables certain types of optimization

您声明要避免数据中出现空值.为此,您应该使用na.drop方法.

You state that you want to avoid null values in your data. For this you should use na.drop method.

df.na.drop()

有关其他处理null的方法,请查看 DataFrameNaFunctions (使用DataFrame.na属性公开)文档.

For other ways of handling nulls please take a look at the DataFrameNaFunctions (exposed using DataFrame.na property) documentation.

CSV格式没有提供任何工具来允许您指定数据约束,因此根据定义,阅读器无法假定输入不为空,并且您的数据确实包含空.

CSV format doesn't provide any tools which allow you to specify data constraints so by definition reader cannot assume that input is not null and your data indeed contains nulls.

这篇关于PySpark:StructField(...,...,False)总是返回`nullable = true`而不是`nullable = false`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆