我可以更改Spark数据框中的列的可空性吗? [英] Can I change the nullability of a column in my Spark dataframe?

查看:57
本文介绍了我可以更改Spark数据框中的列的可空性吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个不能为空的数据帧中有一个StructField.简单的例子:

I have a StructField in a dataframe that is not nullable. Simple example:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

返回:

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

请注意,字段foo不可为空.问题是(出于我不愿讨论的原因)我希望它可以为空.我发现了这篇文章更改spark数据框中列的可空属性提出了一种解决方法,因此我将其中的代码修改为:

Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

失败,原因是:

TypeError:StructField(name,StringType,true)不可序列化JSON

TypeError: StructField(name,StringType,true) is not JSON serializable

我也在堆栈跟踪中看到了这一点:

I also see this in the stack trace:

提高ValueError(已检测到循环参考")

raise ValueError("Circular reference detected")

所以我有点卡住了.谁能以允许我定义foo列可为空的数据框的方式修改此示例?

So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?

推荐答案

似乎您错过了StructType(newSchema).

Seems you missed the StructType(newSchema).

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

这篇关于我可以更改Spark数据框中的列的可空性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆