MRjob:减速机可以执行两次操作吗? [英] MRjob: Can a reducer perform 2 operations?

查看:127
本文介绍了MRjob:减速机可以执行两次操作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图得出映射器生成的每个键,值对所具有的概率.

I am trying to yield the probability each key,value pair generated from mapper has.

所以,可以说mapper产生了:

So, lets say mapper yields:

a, (r, 5)
a, (e, 6)
a, (w, 7)

我需要加5 + 6 + 7 = 18,然后找到概率5/18、6/18、7/18

I need to add 5+6+7 = 18 and then find probabilities 5/18, 6/18, 7/18

所以减速器的最终输出看起来像:

so the final output from the reducer would look like:

a, [[r, 5, 0.278], [e, 6, 0.33], [w, 7, 0.389]]

到目前为止,我只能使化算器对值中的所有整数求和. 如何使它返回并用每个实例除以总和?

so far, I can only get the reducer to sum all integers from the value. How can I make it to go back and divide each instance by the total sum?

谢谢!

推荐答案

Pai的解决方案在技术上是正确的,但实际上这会给您带来很多麻烦,因为设置分区可能会很麻烦(请参阅 https://groups.google.com/forum/#!topic/mrjob/aV7bNn0sJ2k ).

Pai's solution is technically correct, but in practice this will give you a lot of strife, as setting the partitioning can be a big pain (see https://groups.google.com/forum/#!topic/mrjob/aV7bNn0sJ2k).

通过使用mrjob.step,然后创建两个reducer,例如在以下示例中,可以更轻松地完成此任务: https://github.com/Yelp/mrjob/blob/master/mrjob/examples/mr_next_word_stats.py

You can achieve this task more easily by using mrjob.step, and then creating two reducers, such as in this example: https://github.com/Yelp/mrjob/blob/master/mrjob/examples/mr_next_word_stats.py

要描述的是要做到这一点:

To do it in the vein you're describing:

from mrjob.job import MRJob
import re
from mrjob.step import MRStep
from collections import defaultdict

wordRe = re.compile(r"[\w]+")

class MRComplaintFrequencyCount(MRJob):

    def mapper(self, _, line):
        self.increment_counter('group','num_mapper_calls',1)

        #Issue is third column in csv
        issue = line.split(",")[3]

        for word in wordRe.findall(issue):
            #Send all map outputs to same reducer
            yield word.lower(), 1

    def reducer(self, key, values):
        self.increment_counter('group','num_reducer_calls',1)  
        wordCounts = defaultdict(int)
        total = 0         
        for value in values:
            word, count = value
            total+=count
            wordCounts[word]+=count

        for k,v in wordCounts.iteritems():
            # word, frequency, relative frequency 
            yield k, (v, float(v)/total)

    def combiner(self, key, values):
        self.increment_counter('group','num_combiner_calls',1) 
        yield None, (key, sum(values))


if __name__ == '__main__':
    MRComplaintFrequencyCount.run()

这会执行标准的字数统计,并且大多数情况下会在组合器中进行汇总,然后使用"None"作为公用键,因此每个字都将在同一键下间接发送到缩减器.在减速器中,您可以获取总字数并计算相对频率.

This does a standard word count and aggregates mostly in the combiner, then uses "None" as the common key, so every word indirectly gets sent to the reducer under the same key. In the reducer you can get the total word count and compute relative frequencies.

这篇关于MRjob:减速机可以执行两次操作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆