我可以更改Spark数据框中的列的可空性吗? [英] Can I change the nullability of a column in my Spark dataframe?
问题描述
我在一个不能为空的数据帧中有一个StructField.简单的例子:
I have a StructField in a dataframe that is not nullable. Simple example:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
返回:
[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]
[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]
请注意,字段foo
不可为空.问题是(出于我不愿讨论的原因)我希望它可以为空.我发现了这篇文章更改spark数据框中列的可空属性提出了一种解决方法,因此我将其中的代码修改为:
Notice that the field foo
is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:
import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)
失败,原因是:
TypeError:StructField(name,StringType,true)不可序列化JSON
TypeError: StructField(name,StringType,true) is not JSON serializable
我也在堆栈跟踪中看到了这一点:
I also see this in the stack trace:
提高ValueError(已检测到循环参考")
raise ValueError("Circular reference detected")
所以我有点卡住了.谁能以允许我定义foo
列可为空的数据框的方式修改此示例?
So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo
is nullable?
推荐答案
似乎您错过了StructType(newSchema).
Seems you missed the StructType(newSchema).
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()
这篇关于我可以更改Spark数据框中的列的可空性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!