在Apache中使用星火reduceByKey(斯卡拉) [英] Using reduceByKey in Apache Spark (Scala)

查看:127
本文介绍了在Apache中使用星火reduceByKey(斯卡拉)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类型的元组的列表:(用户ID,姓名,计数):

I have a list of Tuples of type : (user id , name , count) :

val x = sc.parallelize(List( ("a" , "b" , 1) , ("a" , "b" , 1) , ("c" , "b" , 1) , ("a" , "d" , 1)))

我尝试这个集合减少到类型,其中每个
元素名称计算。

I'm attempting to reduce this collection to a type where each element name is counted.

所以在上面变种x被转换为:

So in above var x is converted to :

(a,ArrayBuffer((d,1), (b,2)))
(c,ArrayBuffer((b,1)))

下面是我目前使用code:

Here is the code I am currently using :

val byKey = x.map({case (id,uri,count) => (id,uri)->count})

    val grouped = byKey.groupByKey
    val count = grouped.map{case ((id,uri),count) => ((id),(uri,count.sum))}
    val grouped2 :  org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey

    grouped2.foreach(println)

我试图用reduceByKey,因为它的性能比groupByKey快。

I'm attempting to use reduceByKey as it performs faster than groupByKey.

如何reduceByKey来实现,而不是上面code提供
相同的映射?

How can reduceByKey be implemented instead of above code to provide the same mapping ?

推荐答案

按照你code:

val byKey = x.map({case (id,uri,count) => (id,uri)->count})

您可以这样做:

val reducedByKey = byKey.reduceByKey(_ + _)

scala> reducedByKey.collect.foreach(println)
((a,d),1)
((a,b),2)
((c,b),1)

PairRDDFunctions [K,V] .reduceByKey 取缔减少可施加到功能键入RDD第V [(K,V)。换句话说,你需要一个函数 F [V](E1:V,E2:V):V 。在对整型总和这种特殊情况下:(X:智力,Y:强度)=> X + Y _ + _ 总之下划线符号。

PairRDDFunctions[K,V].reduceByKey takes an associative reduce function that can be applied to the to type V of the RDD[(K,V)]. In other words, you need a function f[V](e1:V, e2:V) : V . In this particular case with sum on Ints: (x:Int, y:Int) => x+y or _ + _ in short underscore notation.

有关的记录: reduceByKey ,因为它attemps应用洗牌前,当地减少功能的性能比 groupByKey 更好/ reduce阶段。 groupByKey 将迫使所有元素的洗牌分组之前。

For the record: reduceByKey performs better than groupByKey because it attemps to apply the reduce function locally before the shuffle/reduce phase. groupByKey will force a shuffle of all elements before grouping.

这篇关于在Apache中使用星火reduceByKey(斯卡拉)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆