PySpark拆分行并转换为RDD [英] PySpark split rows and convert to RDD

查看:57
本文介绍了PySpark拆分行并转换为RDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD,其中每个元素都具有以下格式

I have an RDD in which each element is having the following format

['979500797', ' 979500797,260973244733,2014-05-0402:05:12,645/01/105/9931,78,645/01/105/9931,a1,forward;979500797,260972593713,2014-05-0407:05:04,645/01/105/9931,22,645/01/105/863,a4,forward']

我想将其转换为另一个RDD,以使密钥相同,即979500797,但该值是在';'上分割的结果.换句话说,最终的输出应该是

I want to transform it to another RDD such that key is the same i.e. 979500797 but the value is result of splitting on ';' . In other words the final output should be

[
   ['979500797', ' 979500797,260973244733,2014-05-0402:05:12,645/01/105/9931,78,645/01/105/9931,a1,forward']
   ['979500797','979500797,260972593713,2014-05-0407:05:04,645/01/105/9931,22,645/01/105/863,a4,forward']
]

我一直在尝试使用这样的地图

I have been trying to use map like this

df_feat3 = df_feat2.map(lambda (x, y):(x, y.split(';'))) 

但它似乎不起作用

推荐答案

这里需要的是 flatMap . flatMap 具有返回序列并连接结果的功能.

What you need here is a flatMap. flatMap takes function that returns sequence and concatenates the results.

df_feat3 = df_feat2.flatMap(lambda (x, y): ((x, v) for v in y.split(';')))

另一方面,我将避免使用元组参数.这是一个很酷的功能,但是在Python 3中不再可用.请参见 PEP 3113

On a side note I would avoid using tuple parameters. It is a cool feature but it is no longer available in Python 3. See PEP 3113

这篇关于PySpark拆分行并转换为RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆