删除pyspark中的嵌套列 [英] Remove nested column in pyspark

查看:29
本文介绍了删除pyspark中的嵌套列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有列结果的 pyspark 数据框.在结果列中,我想删除该列属性".数据框的架构是:(结果中有更多列,但为了方便我没有显示它们,因为架构很大)

I have a pyspark dataframe with a column results. inside the results column i want to remove the column "Attributes". The schema of the dataframe is:(there are more columns in results but i have not shown them for convenience because the schema is large)

 |-- results: struct (nullable = true)
 |    |-- l: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- Attributes: struct (nullable = true)
 |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |-- Score: struct (nullable = true)
 |    |    |    |    |    |    |    |-- n: string (nullable = true)
 |    |    |    |    |-- OtherInfo: struct (nullable = true)
 |    |    |    |    |    |-- l: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)

如何在 pyspark 中没有 udf 的情况下执行此操作?

How to do this without a udf in pyspark?

一行是:

{
   "results" : {
        "l" : [
            {
              "m":{
                  "Attributes" : {
                      "m" : {
                           "Score" : {"n" : "85"}
                       }
                  },
                   "OtherInfo":{
                      "l" : [
                           {
                             "m" : {
                               "Name" : {"john"}
                             }
                          },
                          {
                             "m" : {
                                "Name" : "Cena"}
                          }
                       ]
                   }
             }
           }   
         ]
    }
}

推荐答案

要从结构类型中删除字段,您必须创建一个包含所有元素的新结构,但要从原始结构中删除的元素除外.

To delete a field from a struct type you have to create a new struct with all the elements but the one you want to delete from the original struct.

这里,由于 results 下的字段 l 是一个数组,您可以使用 transform 函数(Spark 2.4+)像这样更新它的所有结构元素:

Here, as the field l under results is an array, you could use transform function (Spark 2.4+) to update all its struct elements like this:

from pyspark.sql.functions import struct, expr


t_expr = "transform(results.l, x -> struct(struct(x.m.OtherInfo as OtherInfo) as m))"
df = df.withColumn("results", struct(expr(t_expr).alias("l")))

对于数组中的每个元素 x,我们创建一个只包含 x.m.OtherInfo 字段的新结构.

For each element x in the array, we create new struct that holds only x.m.OtherInfo field.

df.printSchema()

#root
# |-- results: struct (nullable = false)
# |    |-- l: array (nullable = true)
# |    |    |-- element: struct (containsNull = false)
# |    |    |    |-- m: struct (nullable = false)
# |    |    |    |    |-- OtherInfo: struct (nullable = true)
# |    |    |    |    |    |-- l: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- m: struct (nullable = true)
# |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)

这篇关于删除pyspark中的嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆