删除pyspark中的嵌套列 [英] Remove nested column in pyspark

查看:90
本文介绍了删除pyspark中的嵌套列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有列结果的pyspark数据框.在结果列中,我想删除该列属性".数据框的架构为:(结果中有更多列,但为方便起见,由于架构较大,我没有显示它们)

I have a pyspark dataframe with a column results. inside the results column i want to remove the column "Attributes". The schema of the dataframe is:(there are more columns in results but i have not shown them for convenience because the schema is large)

 |-- results: struct (nullable = true)
 |    |-- l: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- Attributes: struct (nullable = true)
 |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |-- Score: struct (nullable = true)
 |    |    |    |    |    |    |    |-- n: string (nullable = true)
 |    |    |    |    |-- OtherInfo: struct (nullable = true)
 |    |    |    |    |    |-- l: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)

如何在pyspark中没有udf的情况下执行此操作?

How to do this without a udf in pyspark?

一行是

{
   "results" : {
        "l" : [
            {
              "m":{
                  "Attributes" : {
                      "m" : {
                           "Score" : {"n" : "85"}
                       }
                  },
                   "OtherInfo":{
                      "l" : [
                           {
                             "m" : {
                               "Name" : {"john"}
                             }
                          },
                          {
                             "m" : {
                                "Name" : "Cena"}
                          }
                       ]
                   }
             }
           }   
         ]
    }
}

推荐答案

要从结构类型中删除字段,您必须创建一个具有所有元素(但要从原始结构中删除的元素)的新结构.

To delete a field from a struct type you have to create a new struct with all the elements but the one you want to delete from the original struct.

在这里,由于 results 下的字段 l 是一个数组,因此您可以使用

Here, as the field l under results is an array, you could use transform function (Spark 2.4+) to update all its struct elements like this:

from pyspark.sql.functions import struct, expr


t_expr = "transform(results.l, x -> struct(struct(x.m.OtherInfo as OtherInfo) as m))"
df = df.withColumn("results", struct(expr(t_expr).alias("l")))

对于数组中的每个元素 x ,我们创建仅包含 x.m.OtherInfo 字段的新结构.

For each element x in the array, we create new struct that holds only x.m.OtherInfo field.

df.printSchema()

#root
# |-- results: struct (nullable = false)
# |    |-- l: array (nullable = true)
# |    |    |-- element: struct (containsNull = false)
# |    |    |    |-- m: struct (nullable = false)
# |    |    |    |    |-- OtherInfo: struct (nullable = true)
# |    |    |    |    |    |-- l: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- m: struct (nullable = true)
# |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)

这篇关于删除pyspark中的嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆