减少键值对​​到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark

查看:177
本文介绍了减少键值对​​到密钥列表对与Apache星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写一个应用程序的Spark,并希望一组键 - 值对(K,V1),(K,V2),...,(K,VN)合并为一个重点,对多值(K结合起来, [V1,V2,...,VN])。我觉得我应该能够风味的东西要做到这一点使用ReduceByKey功能:

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the ReduceByKey function with something of the flavor:

My_KMV = My_KV.reduce(拉姆达A,B:a.append([B]))

这是我得到这个时候出现的错误是:

The error that I get when this occurs is:

NoneType'的对象有没有attribue'追加'。

'NoneType' object has no attribue 'append'.

我的钥匙是整数和值V1,...,Vn的是元组。我的目标是建立一个单一的对与键和值(元组)的列表。

My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).

推荐答案

地图和ReduceByKey

输入类型和输出类型减少必须是相同的,因此,如果您希望聚合的列表,你必须地图输入到列表。之后,你的名单组合成一个列表。

Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

组合名单

您需要列表组合成一个列表的方法。癣菌提供了一些方法来组合列表的。

You'll need a method to combine lists into one list. Phyton provides some methods to combine lists.

追加修改第一个列表,总是会返回

append modifies the first list and will always return None.

x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]

延长不相同,但解开列表:

x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]

这两种方法都返回,但你需要一个返回列表相结合的方法,因此仅仅的使用加号

Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.

x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]

星火

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
         .map(lambda actor: (actor.split(",")[0], actor)) \ 

         # transform each value into a list
         .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \

         # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
         .reduceByKey(lambda a, b: a + b)


CombineByKey

它也有可能与 combineByKey 来解决这个问题,这是用于内部实现 reduceByKey ,但它更复杂和<一个href=\"https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html\">\"using在星火专业每个键组合之一,可以更快的。您的使用情况是为上层的解决方案很简单。

It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.

GroupByKey

它也有可能与 groupByKey 来解决这个问题,<一个href=\"http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/$p$pfer_reducebykey_over_groupbykey.html\">but它减少了并行并因此可能对大数据集的要慢得多。

It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

这篇关于减少键值对​​到密钥列表对与Apache星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆