pyspark通过特定键加入rdds [英] pyspark join rdds by a specific key

查看:131
本文介绍了pyspark通过特定键加入rdds的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将两个rdds一起加入.它们如下所示:

I have two rdds that I need to join them together. They look like the followings:

RDD1

[(u'2', u'100', 2),
 (u'1', u'300', 1),
 (u'1', u'200', 1)]

RDD2

[(u'1', u'2'), (u'1', u'3')]

我想要的输出是:

[(u'1', u'2', u'100', 2)]

因此,我想从RDD2中选择具有与RDD1相同的第二个值的那些.我尝试了加入,并且也进行了笛卡尔运算,但是没有一个起作用,甚至没有接近我想要的东西.我是Spark的新手,非常感谢你们的帮助.

So I would like to select those from RDD2 that have the same second value of RDD1. I have tried join and also cartesian and none is working and not getting even close to what I am looking for. I am new to Spark and would appreciate any help from you guys.

谢谢

推荐答案

对我来说,您的过程像手动的一样.这是示例代码:-

For me your process looks like manual. Here is sample code:-

rdd = sc.parallelize([(u'2', u'100', 2),(u'1', u'300', 1),(u'1', u'200', 1)])
rdd1 = sc.parallelize([(u'1', u'2'), (u'1', u'3')])
newRdd = rdd1.map(lambda x:(x[1],x[0])).join(rdd.map(lambda x:(x[0],(x[1],x[2]))))
newRdd.map(lambda x:(x[1][0], x[0], x[1][1][0], x[1][1][1])).coalesce(1).collect()

输出:-

[(u'1', u'2', u'100', 2)]

这篇关于pyspark通过特定键加入rdds的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆