pyspark - 合并 2 列集合 [英] pyspark - merge 2 columns of sets

查看:52
本文介绍了pyspark - 合并 2 列集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 spark 数据框,它有 2 列由函数 collect_set 组成.我想将这 2 列集合组合成 1 列集合.我该怎么做?它们都是一组字符串

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

例如,我通过调用 collect_set 形成了 2 列

For Instance I have 2 columns formed from calling collect_set

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

我如何把它变成:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

非常感谢您提前提供的帮助

Thank you very much for your help in advance

推荐答案

假设 df

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

然后

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

创建一组 Fruits &合二为一即

creates a set of Fruits & Meat combined into one set i.e.

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]


希望这会有所帮助!


Hope this helps!

这篇关于pyspark - 合并 2 列集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆