Pyspark将结构数组转换为字符串 [英] Pyspark converting an array of struct into string
问题描述
我在Pyspark中具有以下数据框
I have the following dataframe in Pyspark
+----+-------+-----+
|name|subject|score|
+----+-------+-----+
| Tom| math| 90|
| Tom|physics| 70|
| Amy| math| 95|
+----+-------+-----+
我使用了 pyspark.sql.functions
df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))
获取以下数据框
+----+--------------------+
|name| score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy| [[math, 95]]|
+----+--------------------+
我的问题是如何将最后一列 score_list
转换为字符串,并将其转储为csv文件,如下所示:
My question is how can I transform the last column score_list
into string and dump it into a csv file looks like
Tom (math, 90) | (physics, 70)
Amy (math, 95)
非常感谢您的帮助.
更新:此处是一个类似的问题,但并不完全相同因为它直接从 string
转到另一个 string
.就我而言,我想先将 string
传输到 collect_list< struct>
,最后将这个 collect_list< struct>字符串化
.
Update: Here is a similar question but it's not exactly the same because it goes directly from string
to another string
. In my case, I want to first transfer string
to collect_list<struct>
and finally stringify this collect_list<struct>
.
推荐答案
我链接的重复项无法完全回答您的问题,因为您要合并多个列.不过,您可以轻松地修改解决方案以适合所需的输出.
The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.
只需将 struct
替换为 concat
添加一个开括号和闭括号,以获得所需的输出.
Just replace the struct
with concat_ws
. Also use concat
to add an opening and closing parentheses to get the output you desire.
from pyspark.sql.functions import concat, concat_ws, lit
df = df.groupBy('name')\
.agg(
concat_ws(
" | ",
collect_list(
concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
)
).alias('score_list')
)
df.show(truncate=False)
#+----+--------------------------+
#|name|score_list |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95) |
#+----+--------------------------+
请注意,由于逗号出现在 score_list
列中,因此,如果使用默认参数,则在写入 csv
时将引用该值.
Note that since the comma appears in the score_list
column, this value will be quoted when you write to csv
if you use the default arguments.
例如:
df.coalesce(1).write.csv("test.csv")
将产生以下输出文件:
Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"
这篇关于Pyspark将结构数组转换为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!