在pyspark中合并两个RDD [英] Combine two RDDs in pyspark

查看:347
本文介绍了在pyspark中合并两个RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我具有以下RDD:

Assuming that I have the following RDDs:

a = sc.parallelize([1, 2, 5, 3])
b = sc.parallelize(['a','c','d','e'])

如何将这两个RDD合并为一个RDD,如下所示:

How do I combine these 2 RDD to one RDD which would be like this:

[('a', 1), ('c', 2), ('d', 5), ('e', 3)]

使用 a.union(b)只是将它们组合在一个列表中.有什么主意吗?

Using a.union(b) just combines them in a list. Any idea?

推荐答案

您可能只想同时对两个RDD进行 b.zip(a)(请注意相反的顺序,因为您想按b的值来键入).

You probably just want to b.zip(a) both RDDs (note the reversed order since you want to key by b's values).

只需阅读 Python文档:

zip(其他)

将此RDD替换为另一个,并返回键值对和每个RDD中的第一个元素,每个RDD中的第二个元素,依此类推.两个RDD具有相同数量的分区和相同的每个分区中元素的数量(例如,通过地图绘制的元素数量)另一方面).

Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).

x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
x.zip(y).collect()
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

这篇关于在pyspark中合并两个RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆