有没有办法在pyspark中收集嵌套模式中所有字段的名称 [英] Is there a way to collect the names of all fields in a nested schema in pyspark

查看：42 发布时间：2021/11/14 22:12:36 apache-spark pyspark apache-spark-sql

本文介绍了有没有办法在pyspark中收集嵌套模式中所有字段的名称的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望收集嵌套模式中所有字段的名称.数据是从 json 文件导入的.

I wish to collect the names of all the fields in a nested schema. The data were imported from a json file.

架构看起来像:

root
 |-- column_a: string (nullable = true)
 |-- column_b: string (nullable = true)
 |-- column_c: struct (nullable = true)
 |    |-- nested_a: struct (nullable = true)
 |    |    |-- double_nested_a: string (nullable = true)
 |    |    |-- double_nested_b: string (nullable = true)
 |    |    |-- double_nested_c: string (nullable = true)
 |    |-- nested_b: string (nullable = true)
 |-- column_d: string (nullable = true)

如果我使用 df.schema.fields 或 df.schema.names 它只打印列层的名称 - 没有嵌套列.

If I use df.schema.fields or df.schema.names it just prints the names of the column layer - none of the nested columns.

我想要的期望输出是一个python列表，其中包含所有列名，例如:

The desired output I want is a python list, which contains all the column names such as:

['column_a', 'columb_b', 'column_c.nested_a.double_nested.a', 'column_c.nested_a.double_nested.b', etc...]

如果我想编写自定义函数，信息就在那里 - 但我错过了一个节拍吗?是否存在实现我所需要的方法?

The information exists there if I want to write a custom function - but am I missing a beat? Does there exist a method that achieves what I need?

推荐答案

默认情况下，Spark 中没有任何方法可以让我们扁平化架构名称.

By default in Spark doesn't have any method to give us flatten the schema names.

使用这篇帖子中的代码:

def flatten(schema, prefix=None):
    fields = []
    for field in schema.fields:
        name = prefix + '.' + field.name if prefix else field.name
        dtype = field.dataType
        if isinstance(dtype, ArrayType):
            dtype = dtype.elementType

        if isinstance(dtype, StructType):
            fields += flatten(dtype, prefix=name)
        else:
            fields.append(name)

    return fields


df.printSchema()
#root
# |-- column_a: string (nullable = true)
# |-- column_c: struct (nullable = true)
# |    |-- nested_a: struct (nullable = true)
# |    |    |-- double_nested_a: string (nullable = true)
# |    |-- nested_b: string (nullable = true)
# |-- column_d: string (nullable = true)

sch=df.schema

print(flatten(sch))
#['column_a', 'column_c.nested_a.double_nested_a', 'column_c.nested_b', 'column_d']

这篇关于有没有办法在pyspark中收集嵌套模式中所有字段的名称的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有没有办法在pyspark中收集嵌套模式中所有字段的名称 [英] Is there a way to collect the names of all fields in a nested schema in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

有没有办法在pyspark中收集嵌套模式中所有字段的名称 [英] Is there a way to collect the names of all fields in a nested schema in pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭