使用PySpark删除Dataframe的嵌套列 [英] Dropping nested column of Dataframe with PySpark

查看：214 发布时间：2020/9/4 1:31:36 apache-spark dataframe pyspark

本文介绍了使用PySpark删除Dataframe的嵌套列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用pyspark在Spark数据框中删除一些嵌套的列. 我为Scala找到了这个功能，它似乎确实可以满足我的要求，但是我对Scala并不熟悉，也不知道如何用Python编写它.

I'm trying to drop some nested columns in a Spark dataframe using pyspark. I found this for Scala that seems to be doing exactly what I want to but I'm not familiar with Scala and don't know how to write it in Python.

https://stackoverflow.com/a/39943812/5706548

我非常感谢您的帮助.

谢谢

推荐答案

我使用pyspark发现的一种方法是，首先将嵌套的列转换为json，然后使用新的嵌套模式解析转换后的json，并过滤掉不需要的列

A method that I found using pyspark is by first converting the nested column into json and then parse the converted json with a new nested schema with the unwanted columns filtered out.

假设我具有以下架构，并且想从数据框中删除d，e和j(a.b.d，a.e，a.h.j):

Suppose I have the following schema and I want to drop d, e and j (a.b.d, a.e, a.h.j) from the dataframe:

root
 |-- a: struct (nullable = true)
 |    |-- b: struct (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |-- e: struct (nullable = true)
 |    |    |-- f: long (nullable = true)
 |    |    |-- g: string (nullable = true)
 |    |-- h: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- i: string (nullable = true)
 |    |    |    |-- j: string (nullable = true)
 |-- k: string (nullable = true)

我使用了以下方法:

通过排除d，e和j为a创建新架构.一种快速的方法是从df.select("a").schema中手动选择所需的字段，然后使用StructType从选定的字段中创建新的架构.或者，您可以通过遍历架构树以编程方式执行此操作，并排除不需要的字段，例如:

Create new schema for a by excluding d, e and j. A quick way to do this is by manually select the fields that you want from df.select("a").schema and create a new schema from the selected fields using StructType. Or, you can do this programmatically by traversing the schema tree and exclude the unwanted fields, something like:

def exclude_nested_field(schema, unwanted_fields, parent=""):
    new_schema = []

    for field in schema:
        full_field_name = field.name
        if parent:
            full_field_name = parent + "." + full_field_name

        if full_field_name not in unwanted_fields:
            if isinstance(field.dataType, StructType):
                inner_schema = exclude_nested_field(field.dataType, unwanted_fields, full_field_name)
                new_schema.append(StructField(field.name, inner_schema))
            elif isinstance(field.dataType, ArrayType):
                inner_schema = exclude_nested_field(field.dataType.elementType, unwanted_fields, full_field_name)
                new_schema.append(StructField(field.name, ArrayType(inner_schema)))
            else:
                new_schema.append(StructField(field.name, field.dataType))

    return StructType(new_schema)

new_schema = exclude_nested_field(df.schema["a"].dataType, ["b.d", "e", "h.j"])

将a列转换为json:.withColumn("json", F.to_json("a")).drop("a")

Convert a column to json: .withColumn("json", F.to_json("a")).drop("a")

这篇关于使用PySpark删除Dataframe的嵌套列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PySpark删除Dataframe的嵌套列 [英] Dropping nested column of Dataframe with PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PySpark删除Dataframe的嵌套列 [英] Dropping nested column of Dataframe with PySpark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭