我可以更改 Spark 数据框中列的可空性吗? [英] Can I change the nullability of a column in my Spark dataframe?

查看:33
本文介绍了我可以更改 Spark 数据框中列的可空性吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个不可为空的数据框中有一个 StructField.简单例子:

I have a StructField in a dataframe that is not nullable. Simple example:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

返回:

[StructField(name,StringType,true),StructField(age,LongType,true),StructField(foo,BooleanType,false)]

[StructField(name,StringType,true), StructField(age,LongType,true), StructField(foo,BooleanType,false)]

请注意,foo 字段不可为空.问题是(出于我不会讨论的原因)我希望它可以为空.我发现这篇文章更改火花数据框中列的可为空属性这提出了一种这样做的方法,因此我将其中的代码调整为:

Notice that the field foo is not nullable. Problem is that (for reasons I won't go into) I want it to be nullable. I found this post Change nullable property of column in spark dataframe which suggested a way of doing it so I adapted the code therein to this:

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

失败:

TypeError: StructField(name,StringType,true) 不是 JSON 可序列化的

TypeError: StructField(name,StringType,true) is not JSON serializable

我也在堆栈跟踪中看到了这一点:

I also see this in the stack trace:

raise ValueError("检测到循环引用")

raise ValueError("Circular reference detected")

所以我有点卡住了.任何人都可以修改此示例,使我能够定义列 foo 可为空的数据框?

So I'm a bit stuck. Can anyone modify this example in a way that enables me to define a dataframe where column foo is nullable?

推荐答案

看来你错过了 StructType(newSchema).

Seems you missed the StructType(newSchema).

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

这篇关于我可以更改 Spark 数据框中列的可空性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆