如何在pyspark的rdd中按一列分组? [英] How to group by one column in rdd in pyspark?
本文介绍了如何在pyspark的rdd中按一列分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
pyspark 中的 rdd 由每个列表中的四个元素组成:
The rdd in pyspark are consist of four elements in every list :
[id1, 'aaa',12,87]
[id2, 'acx',1,90]
[id3, 'bbb',77,10]
[id2, 'bbb',77,10]
.....
我想按第一列的id分组,得到其他三列的聚合结果:例如=> [id2,[['acx',1,90], ['bbb',77,10]...]]
我怎样才能意识到它?
I want to group by the ids in the first columns, and get the aggregate result of the other three columns: for example => [id2,[['acx',1,90], ['bbb',77,10]...]]
How can I realize it ?
推荐答案
spark.version
# u'2.2.0'
rdd = sc.parallelize((['id1', 'aaa',12,87],
['id2', 'acx',1,90],
['id3', 'bbb',77,10],
['id2', 'bbb',77,10]))
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).collect()
# result:
[('id2', [['acx', 1, 90], ['bbb', 77, 10]]),
('id3', [['bbb', 77, 10]]),
('id1', [['aaa', 12, 87]])]
或者,如果你更喜欢严格的列表,你可以在 mapValues
之后再添加一个 map
操作:
or, if you prefer lists strictly, you can add one more map
operation after mapValues
:
rdd.map(lambda x: (x[0], x[1:])).groupByKey().mapValues(list).map(lambda x: list(x)).collect()
# result:
[['id2', [['acx', 1, 90], ['bbb', 77, 10]]],
['id3', [['bbb', 77, 10]]],
['id1', [['aaa', 12, 87]]]]
这篇关于如何在pyspark的rdd中按一列分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文