替换深层嵌套架构Spark Dataframe中的值 [英] Replace value in deep nested schema Spark Dataframe
问题描述
我是pyspark的新手.我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件.我需要用null替换数据框架(带有嵌套模式)中的某些值,我已经看到了这个解决方案,它可以很好地与结构配合使用,但不确定如何与数组一起使用.
I am new to pyspark. I am trying to understand how to access parquet file with multiple level of nested struct and array's. I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution it works fine with structs but it not sure how this works with arrays.
我的模式是这样的
|-- unitOfMeasure: struct
| |-- raw: struct
| | |-- id: string
| | |-- codingSystemId: string
| | |-- display: string
| |-- standard: struct
| | |-- id: string
| | |-- codingSystemId: string
|-- Id: string
|-- actions: array
| |-- element: struct
| | |-- action: string
| | |-- actionDate: string
| | |-- actor: struct
| | | |-- actorId: string
| | | |-- aliases: array
| | | | |-- element: struct
| | | | | |-- value: string
| | | | | |-- type: string
| | | | | |-- assigningAuthority: string
| | | |-- fullName: string
我想做的是将unitOfMeasure.raw.id
替换为null
和actions.element.action
为null
和actions.element.actor.aliases.element.value
为null时,其余数据框保持不变.
What I wanted to do is replace unitOfMeasure.raw.id
to null
and actions.element.action
with null
and actions.element.actor.aliases.element.value
with null keep the rest of my data frame untouched.
有什么办法可以做到这一点?
Is there any way that I can achieve this?
推荐答案
对于数组列,与结构字段相比,它有点复杂. 一种选择是将数组分解为新列,以便您可以访问和更新嵌套的结构.更新后,您将必须重建初始数组列.
For array columns, it's a bit complicated compared to struct fields. One option is to explode the array into new column so that you could access and update the nested structs. After the update, you'll have to reconstruct the initial array column.
但是我更喜欢使用高阶函数 transform
,它是针对Spark> = 2.4引入的,下面是一个示例:
But I'd prefer using higher-order function transform
which is introduced for Spark >=2.4 Here is an example:
输入DF:
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action: string (nullable = true)
| | |-- actionDate: string (nullable = true)
| | |-- actor: struct (nullable = true)
| | | |-- actorId: long (nullable = true)
| | | |-- aliases: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- assigningAuthority: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- value: string (nullable = true)
| | | |-- fullName: string (nullable = true)
+--------------------------------------------------------------+
|actions |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+
我们将lambda函数传递给transfrom
,该函数选择所有struct字段并将actions.action
和actions.actor.aliases.value
替换为null
.
We pass a lambda function to transfrom
which select all the struct fields and replace actions.action
and actions.actor.aliases.value
by null
.
transform_expr = """transform (actions, x ->
struct(null as action,
x.actionDate as actionDate,
struct(x.actor.actorId as actorId,
transform(x.actor.aliases, y ->
struct(null as value,
y.type as type,
y.assigningAuthority as assigningAuthority)
) as aliases,
x.actor.fullName as fullName
) as actor
))"""
df.withColumn("actions", expr(transform_expr)).show(truncate=False)
输出DF:
+------------------------------------------------+
|actions |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+
这篇关于替换深层嵌套架构Spark Dataframe中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!