Pyspark将结构数组转换为字符串 [英] Pyspark converting an array of struct into string

查看:101
本文介绍了Pyspark将结构数组转换为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Pyspark中具有以下数据框

I have the following dataframe in Pyspark

+----+-------+-----+                                                            
|name|subject|score|
+----+-------+-----+
| Tom|   math|   90|
| Tom|physics|   70|
| Amy|   math|   95|
+----+-------+-----+

我使用了 pyspark.sql.functions

df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))

获取以下数据框

+----+--------------------+
|name|          score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy|        [[math, 95]]|
+----+--------------------+

我的问题是如何将最后一列 score_list 转换为字符串,并将其转储为csv文件,如下所示:

My question is how can I transform the last column score_list into string and dump it into a csv file looks like

Tom     (math, 90) | (physics, 70)
Amy     (math, 95)

非常感谢您的帮助.

更新:此处是一个类似的问题,但并不完全相同因为它直接从 string 转到另一个 string .就我而言,我想先将 string 传输到 collect_list< struct> ,最后将这个 collect_list< struct>字符串化 .

Update: Here is a similar question but it's not exactly the same because it goes directly from string to another string. In my case, I want to first transfer string to collect_list<struct> and finally stringify this collect_list<struct>.

推荐答案

我链接的重复项无法完全回答您的问题,因为您要合并多个列.不过,您可以轻松地修改解决方案以适合所需的输出.

The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.

只需将 struct 替换为 concat 添加一个开括号和闭括号,以获得所需的输出.

Just replace the struct with concat_ws. Also use concat to add an opening and closing parentheses to get the output you desire.

from pyspark.sql.functions import concat, concat_ws, lit

df = df.groupBy('name')\
    .agg(
        concat_ws(
            " | ", 
            collect_list(
                concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
            )
        ).alias('score_list')
    )
df.show(truncate=False)

#+----+--------------------------+
#|name|score_list                |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95)                |
#+----+--------------------------+

请注意,由于逗号出现在 score_list 列中,因此,如果使用默认参数,则在写入 csv 时将引用该值.

Note that since the comma appears in the score_list column, this value will be quoted when you write to csv if you use the default arguments.

例如:

df.coalesce(1).write.csv("test.csv")

将产生以下输出文件:

Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"

这篇关于Pyspark将结构数组转换为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆