如何在pyspark数据框中的嵌套结构中添加列? [英] How do I add a column to a nested struct in a pyspark dataframe?
本文介绍了如何在pyspark数据框中的嵌套结构中添加列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个具有类似架构的数据框
I have a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
,我想在state
结构中添加列,即使用类似模式创建数据框
and I'd like to add columns within the state
struct, that is create a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = true)
但是我得到了
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
|-- state.a: integer (nullable = true)
这是来自尝试
df.withColumn('state.a', val)
推荐答案
这是一种无需使用udf
的方法:
Here is a way to do it without using a udf
:
# create example dataframe
import pyspark.sql.functions as f
data = [
({'fld': 0},)
]
schema = StructType(
[
StructField('state',
StructType(
[StructField('fld', IntegerType())]
)
)
]
)
df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
现在使用withColumn()
,并使用lit()
和alias()
添加新字段.
Now use withColumn()
and add the new field using lit()
and alias()
.
val = 1
df_new = df.withColumn(
'state',
f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
如果嵌套结构中有很多字段,则可以使用列表推导,使用df.schema["state"].dataType.names
来获取字段名称.例如:
If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names
to get the field names. For example:
val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
'state',
f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
参考
- 我找到了一种从Struct获取字段名称的方法,而无需从此答案中手动命名它们. >
- I found a way to get the field names from the Struct without naming them manually from this answer.
这篇关于如何在pyspark数据框中的嵌套结构中添加列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文