PySpark:StructField(...,...,False)总是返回`nullable = true`而不是`nullable = false` [英] PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`
问题描述
我是PySpark的新手,正面临一个奇怪的问题.我正在尝试在加载CSV数据集时将某些列设置为不可为空.我可以使用非常小的数据集(test.csv
)复制我的案例:
I'm new to PySpark and am facing a strange problem. I'm trying to set some column to non-nullable while loading a CSV dataset. I can reproduce my case with a very small dataset (test.csv
):
col1,col2,col3
11,12,13
21,22,23
31,32,33
41,42,43
51,,53
第5行第2列有一个空值,我不想在DF中获取该行.我将所有字段都设置为不可为空(nullable=false
),但是我得到了一个架构,其中所有三列都具有nullable=true
.即使我将所有三列都设置为不可为空,也会发生这种情况!我正在运行Spark的最新可用版本2.0.1.
There is a null value at row 5, column 2 and I don't want to get that row inside my DF. I set all fields as non-nullable (nullable=false
) but I get a schema with all the three columns having nullable=true
. This happens even if I set all the three columns as non-nullable! I'm running the latest available version of Spark, 2.0.1.
代码如下:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([ StructField("col1", StringType(), False), \
StructField("col2", StringType(), False), \
StructField("col3", StringType(), False) \
])
df = spark.read.load("test.csv", schema=struct, format="csv", header="true")
df.printSchema()
返回:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
和df.show()
返回:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 11| 12| 13|
| 21| 22| 23|
| 31| 32| 33|
| 41| 42| 43|
| 51|null| 53|
+----+----+----+
我希望这是
root
|-- col1: string (nullable = false)
|-- col2: string (nullable = false)
|-- col3: string (nullable = false)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 11| 12| 13|
| 21| 22| 23|
| 31| 32| 33|
| 41| 42| 43|
+----+----+----+
推荐答案
虽然Spark行为(此处从False
切换到True
令人困惑,但这里根本没有错.nullable
参数不是约束但是反映了源和类型语义,可以实现某些类型的优化
While Spark behavior (switch from False
to True
here is confusing there is nothing fundamentally wrong going on here. nullable
argument is not a constraint but a reflection of the source and type semantics which enables certain types of optimization
您声明要避免数据中出现空值.为此,您应该使用na.drop
方法.
You state that you want to avoid null values in your data. For this you should use na.drop
method.
df.na.drop()
有关其他处理null的方法,请查看 DataFrameNaFunctions
(使用DataFrame.na
属性公开)文档.
For other ways of handling nulls please take a look at the DataFrameNaFunctions
(exposed using DataFrame.na
property) documentation.
CSV格式没有提供任何工具来允许您指定数据约束,因此根据定义,阅读器无法假定输入不为空,并且您的数据确实包含空.
CSV format doesn't provide any tools which allow you to specify data constraints so by definition reader cannot assume that input is not null and your data indeed contains nulls.
这篇关于PySpark:StructField(...,...,False)总是返回`nullable = true`而不是`nullable = false`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!