如何将列添加到 pyspark 数据框中的嵌套结构中? [英] How do I add a column to a nested struct in a pyspark dataframe?

查看：32 发布时间：2021/11/14 22:19:11 apache-spark pyspark apache-spark-sql

本文介绍了如何将列添加到 pyspark 数据框中的嵌套结构中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个类似架构的数据框

I have a dataframe with a schema like

root
 |-- state: struct (nullable = true)
 |    |-- fld: integer (nullable = true)

并且我想在 state 结构中添加列，即创建一个具有类似架构的数据框

and I'd like to add columns within the state struct, that is create a dataframe with a schema like

root
 |-- state: struct (nullable = true)
 |    |-- fld: integer (nullable = true)
 |    |-- a: integer (nullable = true)

但是我得到了

root
 |-- state: struct (nullable = true)
 |    |-- fld: integer (nullable = true)
 |-- state.a: integer (nullable = true)

这是来自尝试

df.withColumn('state.a', val)

推荐答案

这里有一种不使用 udf 的方法:

Here is a way to do it without using a udf:

# create example dataframe
import pyspark.sql.functions as f
data = [
    ({'fld': 0},)
]

schema = StructType(
    [
        StructField('state',
            StructType(
                [StructField('fld', IntegerType())]
            )
        )
    ]
)

df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# |    |-- fld: integer (nullable = true)

现在使用 withColumn() 并使用 lit() 和 alias() 添加新字段.

Now use withColumn() and add the new field using lit() and alias().

val = 1
df_new = df.withColumn(
    'state', 
    f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# |    |-- fld: integer (nullable = true)
# |    |-- a: integer (nullable = false)

如果嵌套结构中有很多字段，则可以使用列表推导式，使用 df.schema["state"].dataType.names 获取字段名称.例如:

If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names to get the field names. For example:

val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
    'state', 
    f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# |    |-- fld: integer (nullable = true)
# |    |-- a: integer (nullable = false)

参考资料

我找到了一种方法，无需从 this answer 手动命名即可从结构中获取字段名称.

I found a way to get the field names from the Struct without naming them manually from this answer.

这篇关于如何将列添加到 pyspark 数据框中的嵌套结构中?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将列添加到 pyspark 数据框中的嵌套结构中? [英] How do I add a column to a nested struct in a pyspark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将列添加到 pyspark 数据框中的嵌套结构中? [英] How do I add a column to a nested struct in a pyspark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭