我可以更改 Spark 数据框中列的可空性吗? [英] Can I change the nullability of a column in my Spark dataframe?
问题描述
我在一个不可为空的数据框中有一个 StructField.简单例子:
I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
返回:
[StructField(name,StringType,true),StructField(age,LongType,true),StructField(foo,BooleanType,false)]
[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]
请注意,foo
字段不可为空.问题是(出于我不会讨论的原因)我希望它可以为空.我发现这篇文章更改火花数据框中列的可为空属性这提出了一种这样做的方法,因此我将其中的代码调整为:
Notice that the field foo
is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)
失败:
TypeError: StructField(name,StringType,true) 不是 JSON 可序列化的
TypeError: StructField(name,StringType,true) is not JSON serializable
我也在堆栈跟踪中看到了这一点:
I also see this in the stack trace:
raise ValueError("检测到循环引用")
raise ValueError("Circular reference detected")
所以我有点卡住了.任何人都可以修改此示例,使我能够定义列 foo
可为空的数据框?
So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo
is nullable?
推荐答案
看来你错过了 StructType(newSchema).
Seems you missed the StructType(newSchema).
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
这篇关于我可以更改 Spark 数据框中列的可空性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!