替换深层嵌套模式 Spark Dataframe 中的值 [英] Replace value in deep nested schema Spark Dataframe

查看:32
本文介绍了替换深层嵌套模式 Spark Dataframe 中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 pyspark 的新手.我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件.我需要用空值替换数据帧(带有嵌套模式)中的一些值,我已经看到了这个 解决方案 它适用于结构,但不确定它如何适用于数组.

I am new to pyspark. I am trying to understand how to access parquet file with multiple level of nested struct and array's. I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution it works fine with structs but it not sure how this works with arrays.

我的架构是这样的

|-- unitOfMeasure: struct
|    |-- raw: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|    |    |-- display: string
|    |-- standard: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|-- Id: string
|-- actions: array
|    |-- element: struct
|    |    |-- action: string
|    |    |-- actionDate: string
|    |    |-- actor: struct
|    |    |    |-- actorId: string
|    |    |    |-- aliases: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- value: string
|    |    |    |    |    |-- type: string
|    |    |    |    |    |-- assigningAuthority: string
|    |    |    |-- fullName: string

我想做的是将 unitOfMeasure.raw.id 替换为 null和 actions.element.action 为 null和 actions.element.actor.aliases.element.value 为 null 保持我的数据框的其余部分不变.

What I wanted to do is replace unitOfMeasure.raw.id to null and actions.element.action with null and actions.element.actor.aliases.element.value with null keep the rest of my data frame untouched.

有什么办法可以做到这一点吗?

Is there any way that I can achieve this?

推荐答案

对于数组列,与 struct 字段相比,它有点复杂.一种选择是将数组分解为新列,以便您可以访问和更新嵌套结构.更新后,您必须重建初始数组列.

For array columns, it's a bit complicated compared to struct fields. One option is to explode the array into new column so that you could access and update the nested structs. After the update, you'll have to reconstruct the initial array column.

但我更喜欢使用高阶函数 transform 为 Spark >=2.4 引入的例子:

But I'd prefer using higher-order function transform which is introduced for Spark >=2.4 Here is an example:

输入DF:

 |-- actions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- action: string (nullable = true)
 |    |    |-- actionDate: string (nullable = true)
 |    |    |-- actor: struct (nullable = true)
 |    |    |    |-- actorId: long (nullable = true)
 |    |    |    |-- aliases: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assigningAuthority: string (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- fullName: string (nullable = true)

+--------------------------------------------------------------+
|actions                                                       |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+

我们将一个 lambda 函数传递给 transfrom,它选择所有结构体字段并替换 actions.actionactions.actor.aliases.value通过 null.

We pass a lambda function to transfrom which select all the struct fields and replace actions.action and actions.actor.aliases.value by null.

transform_expr = """transform (actions, x -> 
                               struct(null as action, 
                                      x.actionDate as actionDate, 
                                      struct(x.actor.actorId as actorId, 
                                             transform(x.actor.aliases, y -> 
                                                       struct(null as value, 
                                                              y.type as type, 
                                                              y.assigningAuthority as assigningAuthority)
                                                       ) as aliases,
                                            x.actor.fullName as fullName
                                      ) as actor
                                ))"""

df.withColumn("actions", expr(transform_expr)).show(truncate=False)

输出DF:

+------------------------------------------------+
|actions                                         |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+

这篇关于替换深层嵌套模式 Spark Dataframe 中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆