如何使用reduceByKey代替GroupByKey构建一个列表？ [英] How can I use reduceByKey instead of GroupByKey to construct a list?

查看：377 发布时间：2016/5/22 15:27:53 python apache-spark pyspark rdd

本文介绍了如何使用reduceByKey代替GroupByKey构建一个列表？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我RDD制成许多物品，其中每一个是一个元组，如下所示：

My RDD is made of many items, each of which is a tuple as follows:

(key1, (val1_key1, val2_key1))
(key2, (val1_key2, val2_key2))
(key1, (val1_again_key1, val2_again_key1))
... and so on

我用GroupByKey对这些都给结果作为

I used GroupByKey on the RDD which gave the result as

(key1, [(val1_key1, val2_key1), (val1_again_key1, val2_again_key1), (), ... ()])
(key2, [(val1_key2, val2_key2), (), () ... ())])
... and so on

我需要做使用reduceByKey相同。我试图做

I need to do the same using reduceByKey. I tried doing

RDD.reduceByKey(lambda val1, val2: list(val1).append(val2))

，但它不工作

请说明正确的方式使用实现reduceByKey（）

Please suggest the right way to implement using reduceByKey()

推荐答案

答案是：你不能（或至少不滥用无活力的语言直接和Python化的方式）。由于值类型和返回类型是不同的（元组VS一个元组列表）减少是不是在这里一个有效的功能。你可以使用 combineByKey 或 aggregateByKey 例如这样的：

The answer is you cannot (or at least not in a straightforward and Pythonic way without abusing language dynamism). Since values type and return type are different (a list of tuples vs a single tuple) reduce is not a valid function here. You could use combineByKey or aggregateByKey for example like this:

rdd = sc.parallelize([
    ("key1", ("val1_key1", "val2_key1")),
    ("key2", ("val1_key2", "val2_key2"))])

rdd.aggregateByKey([], lambda acc, x: acc + [x], lambda acc1, acc2: acc1 + acc2)

但它仅仅是一个 groupByKey 的效率较低版本。另请参见是有史以来groupByKey preferred超过reduceByKey

but it is just a less efficient version of groupByKey. See also Is groupByKey ever preferred over reduceByKey

这篇关于如何使用reduceByKey代替GroupByKey构建一个列表？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用reduceByKey代替GroupByKey构建一个列表？ [英] How can I use reduceByKey instead of GroupByKey to construct a list?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用reduceByKey代替GroupByKey构建一个列表？ [英] How can I use reduceByKey instead of GroupByKey to construct a list?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭