替换深层嵌套架构Spark Dataframe中的值 [英] Replace value in deep nested schema Spark Dataframe

查看:119
本文介绍了替换深层嵌套架构Spark Dataframe中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是pyspark的新手.我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件.我需要用null替换数据框架(带有嵌套模式)中的某些值,我已经看到了这个解决方案,它可以很好地与结构配合使用,但不确定如何与数组一起使用.

I am new to pyspark. I am trying to understand how to access parquet file with multiple level of nested struct and array's. I need to replace some value in a data-frame (with nested schema) with null, I have seen this solution it works fine with structs but it not sure how this works with arrays.

我的模式是这样的

|-- unitOfMeasure: struct
|    |-- raw: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|    |    |-- display: string
|    |-- standard: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|-- Id: string
|-- actions: array
|    |-- element: struct
|    |    |-- action: string
|    |    |-- actionDate: string
|    |    |-- actor: struct
|    |    |    |-- actorId: string
|    |    |    |-- aliases: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- value: string
|    |    |    |    |    |-- type: string
|    |    |    |    |    |-- assigningAuthority: string
|    |    |    |-- fullName: string

我想做的是将unitOfMeasure.raw.id替换为null 和actions.element.action为null 和actions.element.actor.aliases.element.value为null时,其余数据框保持不变.

What I wanted to do is replace unitOfMeasure.raw.id to null and actions.element.action with null and actions.element.actor.aliases.element.value with null keep the rest of my data frame untouched.

有什么办法可以做到这一点?

Is there any way that I can achieve this?

推荐答案

对于数组列,与结构字段相比,它有点复杂. 一种选择是将数组分解为新列,以便您可以访问和更新嵌套的结构.更新后,您将必须重建初始数组列.

For array columns, it's a bit complicated compared to struct fields. One option is to explode the array into new column so that you could access and update the nested structs. After the update, you'll have to reconstruct the initial array column.

但是我更喜欢使用高阶函数 transform ,它是针对Spark> = 2.4引入的,下面是一个示例:

But I'd prefer using higher-order function transform which is introduced for Spark >=2.4 Here is an example:

输入DF:

 |-- actions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- action: string (nullable = true)
 |    |    |-- actionDate: string (nullable = true)
 |    |    |-- actor: struct (nullable = true)
 |    |    |    |-- actorId: long (nullable = true)
 |    |    |    |-- aliases: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assigningAuthority: string (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- fullName: string (nullable = true)

+--------------------------------------------------------------+
|actions                                                       |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+

我们将lambda函数传递给transfrom,该函数选择所有struct字段并将actions.actionactions.actor.aliases.value替换为null.

We pass a lambda function to transfrom which select all the struct fields and replace actions.action and actions.actor.aliases.value by null.

transform_expr = """transform (actions, x -> 
                               struct(null as action, 
                                      x.actionDate as actionDate, 
                                      struct(x.actor.actorId as actorId, 
                                             transform(x.actor.aliases, y -> 
                                                       struct(null as value, 
                                                              y.type as type, 
                                                              y.assigningAuthority as assigningAuthority)
                                                       ) as aliases,
                                            x.actor.fullName as fullName
                                      ) as actor
                                ))"""

df.withColumn("actions", expr(transform_expr)).show(truncate=False)

输出DF:

+------------------------------------------------+
|actions                                         |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+

这篇关于替换深层嵌套架构Spark Dataframe中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆