减少键值对到密钥列表对与Apache星火 [英] Reduce a key-value pair into a key-list pair with Apache Spark
问题描述
我写一个应用程序的Spark,并希望一组键 - 值对(K,V1),(K,V2),...,(K,VN)合并为一个重点,对多值(K结合起来, [V1,V2,...,VN])。我觉得我应该能够风味的东西要做到这一点使用ReduceByKey功能:
I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the ReduceByKey function with something of the flavor:
My_KMV = My_KV.reduce(拉姆达A,B:a.append([B]))
这是我得到这个时候出现的错误是:
The error that I get when this occurs is:
NoneType'的对象有没有attribue'追加'。
'NoneType' object has no attribue 'append'.
我的钥匙是整数和值V1,...,Vn的是元组。我的目标是建立一个单一的对与键和值(元组)的列表。
My keys are integers and values V1,...,Vn are tuples. My goal is to create a single pair with the key and a list of the values (tuples).
推荐答案
地图和ReduceByKey
输入类型和输出类型减少
必须是相同的,因此,如果您希望聚合的列表,你必须地图
输入到列表。之后,你的名单组合成一个列表。
Input type and output type of reduce
must be the same, therefore if you want to aggregate a list, you have to map
the input to lists. Afterwards you combine the lists into one list.
组合名单
您需要列表组合成一个列表的方法。癣菌提供了一些方法来组合列表的。
You'll need a method to combine lists into one list. Phyton provides some methods to combine lists.
追加
修改第一个列表,总是会返回无
。
append
modifies the first list and will always return None
.
x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]
延长
不相同,但解开列表:
x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]
这两种方法都返回无
,但你需要一个返回列表相结合的方法,因此仅仅的使用加号。
Both methods return None
, but you'll need a method that returns the combined list, therefore just use the plus sign.
x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]
星火
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda actor: (actor.split(",")[0], actor)) \
# transform each value into a list
.map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \
# combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
.reduceByKey(lambda a, b: a + b)
CombineByKey
它也有可能与 combineByKey
来解决这个问题,这是用于内部实现 reduceByKey
,但它更复杂和<一个href=\"https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html\">\"using在星火专业每个键组合之一,可以更快的。您的使用情况是为上层的解决方案很简单。
It's also possible to solve this with combineByKey
, which is used internally to implement reduceByKey
, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.
GroupByKey
它也有可能与 groupByKey
来解决这个问题,<一个href=\"http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/$p$pfer_reducebykey_over_groupbykey.html\">but它减少了并行并因此可能对大数据集的要慢得多。
It's also possible to solve this with groupByKey
, but it reduces parallelization and therefore could be much slower for big data sets.
这篇关于减少键值对到密钥列表对与Apache星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!