Spark数据框reduceByKey [英] Spark dataframe reduceByKey
问题描述
我正在使用Spark 1.5/1.6,我想在DataFrame中进行reduceByKey操作,我不想将df转换为rdd.
I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
每行看起来都很像,而我的id1有多行.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
我想要类似的东西:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
因此,对于每个"id1",我希望所有记录都在列表中
So, for each "id1", I want all records in a list
顺便说一句,不想将df转换为rdd的原因是因为我必须将这个(精简的)数据帧连接到另一个数据帧,并且我正在对join键进行重新分区,这使其速度更快,我想用rdd不能做到这一点
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
任何帮助将不胜感激.
推荐答案
要简单地保存已实现的分区,请在reduceByKey
调用中重新使用父RDD分区程序:
To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey
invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
如果要不,请按以下方式指定分区程序:
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
然后您注意到的行为就会发生-即发生完全的随机播放.这是因为将使用HashPartitioner
代替.
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner
would be used instead.
这篇关于Spark数据框reduceByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!