`combineByKey`,pyspark [英] `combineByKey`, pyspark

查看:76
本文介绍了`combineByKey`,pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只是想知道这到底是做什么的?我理解keyBy,但是我很难理解combineByKey是什么.我已阅读以下页面 (链接),但仍然不了解.

Just wondering what exactly does this do? I understand keyBy, but I struggle to under what exacltly is that combineByKey. I have read through pages (link) and still don't understand.

df.rdd.keyBy(
        lambda row: row['id']
    ).combineByKey(
        lambda row: [row],
        lambda rows, row: rows + [row],
        lambda rows1, rows2: rows1 + rows2,
    )
)

推荐答案

简而言之,CombineByKey可让您明确指定汇总(或减少)rdd的3个阶段.

In short, combineByKey lets you specify explicitly the 3 stages of aggregating (or reducing) your rdd.

1.第一次遇到单行时该怎么办?

在您提供的示例中,该行被放置在列表中.

In the example you have provided the row is put in a list.

2.当单行遇到先前减少的行时该怎么办?

在该示例中,先前减少的行已经是一个列表,我们将新行添加到该行中并返回新的扩展列表.

In the example, a previously reduced row is already a list and we add to it the new row and return the new, extended, list.

3.对两个先前减少的行怎么办?

在上面的示例中,两行都已经是列表,并且我们返回一个包含两个项目的新列表.

In the example above, both rows are already lists and we return a new list with the items from both of them.

这些链接中提供了更多的,经过详尽解释的分步示例:

More, well explained, step by step examples are available in those links:

http ://etlcode.com/index.php/blog/info/Bigdata/Apache-Spark-Difference-between-reduceByKey-groupByKey-and-combineByKey

第二个链接的主要解释是:

A key explanation from the second link is:

让我们看看在我们的用例中combinedByKey是如何工作的.当CombineByKey在每个元素中导航时,即分区1-(Messi,45),它具有一个以前从未见过的键,而当移至下一个(Messi,48)时,它将获得一个先前已见过的键.第一次看到一个元素时,combinedByKey()使用名为createCombiner的函数为该键上的累加器创建一个初始值.也就是说,它使用Messi作为键,并使用45作为值.因此,该键(Messi)的累加器的当前值变为45.现在下次CombineByKey()在同一分区上看到相同的键时,它不使用createCombiner,而是将第二个函数mergeValue与累加器的当前值(45)一起使用,并且新值48.

Let us see how combineByKey works in our use case. When combineByKey navigates through each element i.e for partition 1 - (Messi,45) it has a key which it has not seen before and when it moves to next (Messi,48) it gets a key which it has seen before. When it first time see a element , combineByKey() use function called createCombiner to create an initial value for the accumulator on that key. i.e. it use Messi as the key and 45 as value. So current value of the accumulator of that key (Messi) becomes 45. Now next time combineByKey() sees same key on same partition it does not use createCombiner instead it will make use of second function mergeValue with current value of accumulator (45) and new value 48.

由于所有这些并行发生在不同的分区中,因此存在 相同的密钥存在于另一分区的其他分区上的机会 蓄能器.所以当来自不同分区的结果必须是 使用mergeCombiners函数将其合并.

Since all this happens parallel in different partition, there is chance that same key exist on other partition with other set of accumulators. So when results from different partition has to be merged it use mergeCombiners function.

这篇关于`combineByKey`,pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆