pyspark-合并2列集合 [英] pyspark - merge 2 columns of sets

查看:884
本文介绍了pyspark-合并2列集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个spark数据帧,该数据帧具有由collect_set函数形成的2列.我想将这2列集合合并为1列集合.我应该怎么做?它们都是字符串

I have a spark dataframe that has 2 columns formed from the function collect_set. I would like to combine these 2 columns of sets into 1 column of set. How should I do so? They are both set of strings

对于实例,我通过调用collect_set形成了2列

For Instance I have 2 columns formed from calling collect_set

Fruits                  |    Meat
[Apple,Orange,Pear]          [Beef, Chicken, Pork]

如何将其转换为:

Food

[Apple,Orange,Pear, Beef, Chicken, Pork]

非常感谢您的提前帮助

推荐答案

让我们说df

+--------------------+--------------------+
|              Fruits|                Meat|
+--------------------+--------------------+
|[Pear, Orange, Ap...|[Chicken, Pork, B...|
+--------------------+--------------------+

然后

import itertools
df.rdd.map(lambda x: [item for item in itertools.chain(x.Fruits, x.Meat)]).collect()

创建一组Fruits& Meat组合成一组,即

creates a set of Fruits & Meat combined into one set i.e.

[[u'Pear', u'Orange', u'Apple', u'Chicken', u'Pork', u'Beef']]


希望这会有所帮助!


Hope this helps!

这篇关于pyspark-合并2列集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆