Pyspark将结构数组转换为字符串 [英] Pyspark converting an array of struct into string

查看:41
本文介绍了Pyspark将结构数组转换为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Pyspark 中有以下数据框

I have the following dataframe in Pyspark

+----+-------+-----+                                                            
|name|subject|score|
+----+-------+-----+
| Tom|   math|   90|
| Tom|physics|   70|
| Amy|   math|   95|
+----+-------+-----+

我使用了 pyspark.sql.functions

df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))

获取以下数据框

+----+--------------------+
|name|          score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy|        [[math, 95]]|
+----+--------------------+

我的问题是如何将最后一列 score_list 转换为字符串并将其转储到 csv 文件中,如下所示

My question is how can I transform the last column score_list into string and dump it into a csv file looks like

Tom     (math, 90) | (physics, 70)
Amy     (math, 95)

感谢您的帮助,谢谢.

更新:这里是一个类似的问题,但并不完全相同因为它直接从 string 到另一个 string.就我而言,我想首先将 string 传输到 collect_list 并最终将这个 collect_list 字符串化..

Update: Here is a similar question but it's not exactly the same because it goes directly from string to another string. In my case, I want to first transfer string to collect_list<struct> and finally stringify this collect_list<struct>.

推荐答案

我链接的重复项并不能完全回答您的问题,因为您正在合并多个列.不过,您可以很容易地修改解决方案以适应您想要的输出.

The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.

只需将 struct 替换为 concat_ws.也使用 concat 添加左括号和右括号以获得您想要的输出.

Just replace the struct with concat_ws. Also use concat to add an opening and closing parentheses to get the output you desire.

from pyspark.sql.functions import concat, concat_ws, lit

df = df.groupBy('name')\
    .agg(
        concat_ws(
            " | ", 
            collect_list(
                concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
            )
        ).alias('score_list')
    )
df.show(truncate=False)

#+----+--------------------------+
#|name|score_list                |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95)                |
#+----+--------------------------+

请注意,由于逗号出现在 score_list 列中,如果您使用默认参数,则在写入 csv 时将引用此值.

Note that since the comma appears in the score_list column, this value will be quoted when you write to csv if you use the default arguments.

例如:

df.coalesce(1).write.csv("test.csv")

将产生以下输出文件:

Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"

这篇关于Pyspark将结构数组转换为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆