在RDD的每个分区上分别进行reduceByKey而不汇总结果 [英] Doing reduceByKey on each partition of RDD separately without aggregating results

查看：84 发布时间：2021/4/8 20:02:56 scala apache-spark rdd

本文介绍了在RDD的每个分区上分别进行reduceByKey而不汇总结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在群集上有一个RDD分区，我想在每个分区上分别做 reduceByKey .我不希望将reduceByKey在分区上的结果合并到一起.我想防止Spark在集群中随机播放reduceByKey的中间结果.

I have an RDD partitioned on the cluster and I want to do reduceByKey on each partition separately. I don't want result of reduceByKey on partitions to be merged together. I want to prevent Spark to do shuffle intermediate results of reduceByKey in the cluster.

下面的代码不起作用，但我想这样:

The below code does not work but I want sth like this:

myPairedRDD.mapPartitions({iter => iter.reduceByKey((x, y) => x + y)})

我该如何实现?

推荐答案

您可以尝试一些

myPairedRDD.mapPartitions(iter => 
  iter.groupBy(_._1).mapValues(_.map(_._2).reduce(_ + _)).iterator
)

或保持更多的内存效率(这里我假设 myPairedRDD 是 RDD [(String，Double)] .请调整类型以匹配您的用例):

or to keep things more memory efficient (here I assume that myPairedRDD is RDD[(String, Double)]. Please adjust types to match your use case):

myPairedRDD.mapPartitions(iter => 
  iter.foldLeft(mutable.Map[String, Double]().withDefaultValue(0.0)){ 
    case  (acc, (k, v)) => {acc(k) += v; acc}
  }.iterator
)

但是请注意，与改组操作不同，它不能从内存中卸载数据.

but please note, that unlike shuffling operations, it cannot offload data from memory.

这篇关于在RDD的每个分区上分别进行reduceByKey而不汇总结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在RDD的每个分区上分别进行reduceByKey而不汇总结果 [英] Doing reduceByKey on each partition of RDD separately without aggregating results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在RDD的每个分区上分别进行reduceByKey而不汇总结果 [英] Doing reduceByKey on each partition of RDD separately without aggregating results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭