Apache Spark - reducebyKey - Java - [英] Apache Spark - reducebyKey - Java -

查看:106
本文介绍了Apache Spark - reducebyKey - Java - 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用java作为编程语言来理解Spark中 reduceByKey 的工作原理。

I am trying to understand the working of the reduceByKey in Spark using java as the programming language.

说我有句我就是我自己。
我将句子分成单词并将其存储为列表 [我,我,谁,我,我]

Say I have a sentence "I am who I am". I break the sentence into words and store it as a list [I, am, who, I, am].

现在这个函数为每个单词分配 1

Now this function assigns 1 to each word:

JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) {
        return new Tuple2<String, Integer>(s, 1);
    }
});

所以输出是这样的:

(I,1) 
(am,1)
(who,1)
(I,1)
(am,1)

现在,如果我有3个减速器运行,每个减速器将获得一个密钥和相关的值那个键:

Now if I have 3 reducers running, each reducer will get a key and the values associated with that key:

reducer 1:
    (I,1)
    (I,1)

reducer 2:
    (am,1)
    (am,1)

reducer 3:
    (who,1)

我想知道

a。在下面的函数中到底发生了什么。

b。参数是什么 new Function2< Integer,Integer,Integer>

c。基本上是如何形成JavaPairRDD。

a. What exactly happens here in the function below.
b. What are the parameters new Function2<Integer, Integer, Integer>
c. Basically how the JavaPairRDD is formed.

JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
    @Override
    public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
    }
});


推荐答案

我认为你的问题围绕着reduce函数,这是2个参数返回1的函数,而在Reducer中,你实现了多对多的函数。

I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many.

如果不那么通用,这个API会更简单。在这里,您提供了一个关联操作,可以将任何2个值减少到1(例如,两个整数总和为1)。这用于将每个键的所有值减少到1.没有必要提供N对1功能,因为它可以通过2对1功能来完成。在这里,你不能为一个键发出多个值。

This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1 (e.g. two integers sum to one). This is used to reduce all values for each key to 1. It's not necessary to provide an N-to-1 function since it can be accomplished with a 2-to-1 function. Here, you can't emit multiple values for one key.

结果是每个(键,一组值)的键(减值)。

The result are (key, reduced value) from each (key, bunch of values).

经典Hadoop MapReduce中的Mapper和Reducer实际上都非常相似(只需要一个值集合而不是每个键的单个值)并让你实现很多模式。以一种浪费和复杂的方式表现良好。

The Mapper and Reducer in classic Hadoop MapReduce were actually both quite similar (just that one takes a collection of values rather than single value per key) and let you implement a lot of patterns. In a way that's good, in a way that was wasteful and complex.

您仍然可以重现Mappers和Reducers的功能,但Spark中的方法是mapPartitions,可能是配对的与groupByKey。这些是您可能考虑的最常用的操作,我不是说应该在Spark中以这种方式模拟MapReduce。事实上,它不太可能有效率。但这是有可能的。

You can still reproduce what Mappers and Reducers do, but the method in Spark is mapPartitions, possibly paired with groupByKey. These are the most general operations you might consider, and I'm not saying you should emulate MapReduce this way in Spark. In fact it's unlikely to be efficient. But it is possible.

这篇关于Apache Spark - reducebyKey - Java - 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆