谁能给Spark中的`combineByKey`给出清晰的解释? [英] Who can give a clear explanation for `combineByKey` in Spark?

查看:70
本文介绍了谁能给Spark中的`combineByKey`给出清晰的解释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习spark,但是我不理解此功能combineByKey.

i am learning spark, but i can't understand this function combineByKey.

>>> data = sc.parallelize([("A",1),("A",2),("B",1),("B",2),("C",1)] )
>>> data.combineByKey(lambda v : str(v)+"_", lambda c, v : c+"@"+str(v), lambda c1, c2 : c1+c2).collect()

输出为:

[('A', '1_2_'), ('C', '1_'), ('B', '1_2_')]

首先,我很困惑:第二步lambda c, v : c+"@"+v中的@在哪里?我从结果中找不到@.

First, i am very confused: where is the @ in second step lambda c, v : c+"@"+v? i can't find @ from the result.

第二,我阅读了combineByKey的功能说明,但是我对算法流程感到困惑.

Second, i read the function description for combineByKey, but i am confused the algorithm flow.

推荐答案

groupByKey调用不会尝试合并/合并值,因此这是一项昂贵的操作.

The groupByKey call makes no attempt at merging/combining values, so it’s an expensive operation.

因此combineByKey调用就是这样的优化.使用combineByKey时,在每个分区上将值合并为一个值,然后将每个分区值合并为一个值.值得注意的是,合并值的类型不必与原始值的类型相匹配,而且通常不需要. combineByKey函数采用3个函数作为参数:

Thus the combineByKey call is just such an optimization. When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value. It’s worth noting that the type of the combined value does not have to match the type of the original value and often times it won’t be. The combineByKey function takes 3 functions as arguments:

  1. 创建组合器的函数.在aggregateByKey函数中,第一个参数只是一个初始零值.在combineByKey中,我们提供了一个函数,该函数将接受当前值作为参数并返回将与其他值合并的新值.

  1. A function that creates a combiner. In the aggregateByKey function the first argument was simply an initial zero value. In combineByKey we provide a function that will accept our current value as a parameter and return our new value that will be merged with additional values.

第二个函数是合并函数,该函数接受一个值并将其合并/合并为以前收集的值.

The second function is a merging function that takes a value and merges/combines it into the previously collected values.

第三个函数将合并的值组合在一起.基本上,此函数采用在分区级别产生的新值并将它们组合在一起,直到最终得到一个奇异值.

The third function combines the merged values together. Basically this function takes the new values produced at the partition level and combines them until we end up with one singular value.

换句话说,要理解combineByKey,考虑一下它如何处理所处理的每个元素很有用.当combineByKey遍历分区中的元素时,每个元素要么具有一个之前未曾看到的键,要么具有与先前元素相同的键.

In other words, to understand combineByKey, it’s useful to think of how it handles each element it processes. As combineByKey goes through the elements in a partition, each element either has a key it hasn’t seen before or has the same key as a previous element.

如果是新元素,则combineByKey使用我们提供的称为createCombiner()的函数为该键上的累加器创建初始值.重要的是要注意,这种情况是在每个分区中第一次找到密钥时发生的,而不是在RDD中第一次发现密钥时发生.

If it’s a new element, combineByKey uses a function we provide, called createCombiner(), to create the initial value for the accumulator on that key. It’s important to note that this happens the first time a key is found in each partition, rather than only the first time the key is found in the RDD.

如果这是我们之前在处理该分区时所见过的值,它将使用提供的函数mergeValue()以及该键的累加器的当前值和新值.

If it is a value we have seen before while processing that partition, it will instead use the provided function, mergeValue(), with the current value for the accumulator for that key and the new value.

由于每个分区都是独立处理的,因此对于同一个密钥,我们可以有多个累加器.当我们合并每个分区的结果时,如果两个或多个分区具有相同密钥的累加器,我们将使用用户提供的mergeCombiners()函数合并累加器.

Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied mergeCombiners() function.

参考:

  • Learning Spark - Chapter 4.
  • Using combineByKey in Apache-Spark blog entry.

这篇关于谁能给Spark中的`combineByKey`给出清晰的解释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆