使用 PySpark 删除 Dataframe 的嵌套列 [英] Dropping nested column of Dataframe with PySpark
问题描述
我正在尝试使用 pyspark 在 Spark 数据框中删除一些嵌套列.我发现这个 Scala 似乎正在做我想做的事,但我不熟悉 Scala,也不知道如何用 Python 编写它.
I'm trying to drop some nested columns in a Spark dataframe using pyspark. I found this for Scala that seems to be doing exactly what I want to but I'm not familiar with Scala and don't know how to write it in Python.
https://stackoverflow.com/a/39943812/5706548
我非常感谢您的帮助.
谢谢,
推荐答案
我发现使用 pyspark 的一种方法是首先将嵌套列转换为 json,然后使用新的嵌套模式解析转换后的 json,过滤掉不需要的列.
A method that I found using pyspark is by first converting the nested column into json and then parse the converted json with a new nested schema with the unwanted columns filtered out.
假设我有以下模式,我想删除 d
、e
和 j
(abd
,ae
, ahj
) 来自数据帧:
Suppose I have the following schema and I want to drop d
, e
and j
(a.b.d
, a.e
, a.h.j
) from the dataframe:
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
| | |-- d: string (nullable = true)
| |-- e: struct (nullable = true)
| | |-- f: long (nullable = true)
| | |-- g: string (nullable = true)
| |-- h: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- i: string (nullable = true)
| | | |-- j: string (nullable = true)
|-- k: string (nullable = true)
我使用了以下方法:
通过排除
d
、e
和j
为a
创建新模式.一种快速的方法是从df.select("a").schema
中手动选择您想要的字段,并使用StructType
从所选字段创建一个新模式>.或者,您可以通过遍历架构树并排除不需要的字段来以编程方式执行此操作,例如:
Create new schema for
a
by excludingd
,e
andj
. A quick way to do this is by manually select the fields that you want fromdf.select("a").schema
and create a new schema from the selected fields usingStructType
. Or, you can do this programmatically by traversing the schema tree and exclude the unwanted fields, something like:
def exclude_nested_field(schema, unwanted_fields, parent=""):
new_schema = []
for field in schema:
full_field_name = field.name
if parent:
full_field_name = parent + "." + full_field_name
if full_field_name not in unwanted_fields:
if isinstance(field.dataType, StructType):
inner_schema = exclude_nested_field(field.dataType, unwanted_fields, full_field_name)
new_schema.append(StructField(field.name, inner_schema))
elif isinstance(field.dataType, ArrayType):
inner_schema = exclude_nested_field(field.dataType.elementType, unwanted_fields, full_field_name)
new_schema.append(StructField(field.name, ArrayType(inner_schema)))
else:
new_schema.append(StructField(field.name, field.dataType))
return StructType(new_schema)
new_schema = exclude_nested_field(df.schema["a"].dataType, ["b.d", "e", "h.j"])
将a
列转换为json:.withColumn("json", F.to_json("a")).drop("a")
这篇关于使用 PySpark 删除 Dataframe 的嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!