减少键值对到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark

查看：177 发布时间：2016/5/22 15:28:26 python mapreduce apache-spark

本文介绍了减少键值对到密钥列表对与Apache星火的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写一个应用程序的Spark，并希望一组键 - 值对（K，V1），（K，V2），...，（K，VN）合并为一个重点，对多值（K结合起来， [V1，V2，...，VN]）。我觉得我应该能够风味的东西要做到这一点使用ReduceByKey功能：

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the ReduceByKey function with something of the flavor:

My_KMV = My_KV.reduce（拉姆达A，B：a.append（[B]））

这是我得到这个时候出现的错误是：

The error that I get when this occurs is:

NoneType'的对象有没有attribue'追加'。

'NoneType' object has no attribue 'append'.

我的钥匙是整数和值V1，...，Vn的是元组。我的目标是建立一个单一的对与键和值（元组）的列表。

My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).

推荐答案

地图和ReduceByKey

输入类型和输出类型减少必须是相同的，因此，如果您希望聚合的列表，你必须地图输入到列表。之后，你的名单组合成一个列表。

Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

组合名单

您需要列表组合成一个列表的方法。癣菌提供了一些方法来组合列表的。

You'll need a method to combine lists into one list. Phyton provides some methods to combine lists.

追加修改第一个列表，总是会返回无。

append modifies the first list and will always return None.

x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]

延长不相同，但解开列表：

x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]

这两种方法都返回无，但你需要一个返回列表相结合的方法，因此仅仅的使用加号。

Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.

x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]

星火

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
         .map(lambda actor: (actor.split(",")[0], actor)) \ 

         # transform each value into a list
         .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \

         # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
         .reduceByKey(lambda a, b: a + b)

CombineByKey

它也有可能与 combineByKey 来解决这个问题，这是用于内部实现 reduceByKey ，但它更复杂和<一个href=\"https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html\">\"using在星火专业每个键组合之一，可以更快的。您的使用情况是为上层的解决方案很简单。

It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.

GroupByKey

它也有可能与 groupByKey 来解决这个问题，<一个href=\"http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/$p$pfer_reducebykey_over_groupbykey.html\">but它减少了并行并因此可能对大数据集的要慢得多。

It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

这篇关于减少键值对到密钥列表对与Apache星火的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

减少键值对到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

减少键值对​​到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

减少键值对到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark

登录关闭