如何组织RDD PySpark建议 [英] PySpark Suggestion on how to organize RDD
问题描述
我是一个Spark noobie,我试图测试出来的东西就Spark和看看是否有数据的大小,我使用的性能提升。
I'm a Spark noobie and I'm trying to test something out on Spark and see if there are any performance boosts for the size of data that I'm using.
在我的RDD每个对象都包含一个时间,ID和位置。
Each object in my rdd contains a time, id, and position.
我想比较含有相同ID同一时代集团的立场。所以,我会先运行以下的话,被ID分组
I want to compare the positions of groups with same times containing the same id. So, I would first run the following to get grouped by id
grouped_rdd = rdd.map(lambda x: (x.id, [x])).groupByKey()
我然后像闯入每个对象的时候这一点。
I would then like to break this into the time of each object.
有什么建议?谢谢!
推荐答案
首先,如果你想既 ID
和时间
为重点,只是把它们都变成重要组成部分,而不是第一组由ID,然后按时间分开。
First of all, if you want both id
and time
as key, just put them both into key part, rather than first group by id, then group by time separately.
m = sc.parallelize([(1,2,3),(1,2,4),(2,3,5)])
n = m.map(lambda x: ((x[0], x[1]), x[2]))
第二,<一个href=\"http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/$p$pfer_reducebykey_over_groupbykey.html\"相对=nofollow>避免它执行坏,并使用 combineByKey
或 reduceByKey
如果可能的话GroupByKey 。
Secondly, Avoid GroupByKey which performs bad and use combineByKey
or reduceByKey
if possible.
这篇关于如何组织RDD PySpark建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!