删除pyspark中的嵌套列 [英] Remove nested column in pyspark
问题描述
我有一个带有列结果的 pyspark 数据框.在结果列中,我想删除该列属性".数据框的架构是:(结果中有更多列,但为了方便我没有显示它们,因为架构很大)
I have a pyspark dataframe with a column results. inside the results column i want to remove the column "Attributes". The schema of the dataframe is:(there are more columns in results but i have not shown them for convenience because the schema is large)
|-- results: struct (nullable = true)
| |-- l: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- m: struct (nullable = true)
| | | | |-- Attributes: struct (nullable = true)
| | | | | |-- m: struct (nullable = true)
| | | | | | |-- Score: struct (nullable = true)
| | | | | | | |-- n: string (nullable = true)
| | | | |-- OtherInfo: struct (nullable = true)
| | | | | |-- l: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- m: struct (nullable = true)
| | | | | | | | |-- Name: string (nullable = true)
如何在 pyspark 中没有 udf 的情况下执行此操作?
How to do this without a udf in pyspark?
一行是:
{
"results" : {
"l" : [
{
"m":{
"Attributes" : {
"m" : {
"Score" : {"n" : "85"}
}
},
"OtherInfo":{
"l" : [
{
"m" : {
"Name" : {"john"}
}
},
{
"m" : {
"Name" : "Cena"}
}
]
}
}
}
]
}
}
推荐答案
要从结构类型中删除字段,您必须创建一个包含所有元素的新结构,但要从原始结构中删除的元素除外.
To delete a field from a struct type you have to create a new struct with all the elements but the one you want to delete from the original struct.
这里,由于 results
下的字段 l
是一个数组,您可以使用 transform
函数(Spark 2.4+)像这样更新它的所有结构元素:
Here, as the field l
under results
is an array, you could use transform
function (Spark 2.4+) to update all its struct elements like this:
from pyspark.sql.functions import struct, expr
t_expr = "transform(results.l, x -> struct(struct(x.m.OtherInfo as OtherInfo) as m))"
df = df.withColumn("results", struct(expr(t_expr).alias("l")))
对于数组中的每个元素 x
,我们创建一个只包含 x.m.OtherInfo
字段的新结构.
For each element x
in the array, we create new struct that holds only x.m.OtherInfo
field.
df.printSchema()
#root
# |-- results: struct (nullable = false)
# | |-- l: array (nullable = true)
# | | |-- element: struct (containsNull = false)
# | | | |-- m: struct (nullable = false)
# | | | | |-- OtherInfo: struct (nullable = true)
# | | | | | |-- l: array (nullable = true)
# | | | | | | |-- element: struct (containsNull = true)
# | | | | | | | |-- m: struct (nullable = true)
# | | | | | | | | |-- Name: string (nullable = true)
这篇关于删除pyspark中的嵌套列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!