如何使用 Spark 计算累积总和 [英] How to compute cumulative sum using Spark
本文介绍了如何使用 Spark 计算累积总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个按键排序的 (String,Int) rdd
I have an rdd of (String,Int) which is sorted by key
val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey
现在我想用零开始第一个键的值,随后的键作为前一个键的总和.
Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.
例如:c1 = 0 , c2 = c1 的值 , c3 = (c1 值 +c2 值) , c4 = (c1+..+c3 值)预期输出:
Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value) expected output:
(c1,0), (c2,6), (c3,9)...
有可能实现吗?我用地图试过了,但总和没有保存在地图内.
Is it possible to achieve this ? I tried it with map but the sum is not preserved inside the map.
var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}
推荐答案
计算每个分区的部分结果:
Compute partial results for each partition:
val partials = rdd.mapPartitionsWithIndex((i, iter) => {
val (keys, values) = iter.toSeq.unzip
val sums = values.scanLeft(0)(_ + _)
Iterator((keys.zip(sums.tail), sums.last))
})
收集部分和
Collect partials sums
val partialSums = partials.values.collect
计算分区的累积总和并广播它:
Compute cumulative sum over partitions and broadcast it:
val sumMap = sc.broadcast(
(0 until rdd.partitions.size)
.zip(partialSums.scanLeft(0)(_ + _))
.toMap
)
计算最终结果:
Compute final results:
val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
val offset = sumMap.value(i)
if (iter.isEmpty) Iterator()
else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
})
这篇关于如何使用 Spark 计算累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文