如何使Reducer仅发出重复项 [英] How to get the Reducer to emit only duplicates

查看:58
本文介绍了如何使Reducer仅发出重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Mapper,它正在处理大量数据并以ID值为1的形式散发ID号.我希望通过MapReduce作业来完成的工作是获取所有已发现多个ID的所有ID的列表.所有数据的时间,这是重复ID的列表.例如:

I have a Mapper that is going through lots of data and emitting ID numbers as keys with the value of 1. What I hope to accomplish with the MapReduce job is to get a list of all IDs that have been found more than one time across all data, which is a list of duplicate IDs. For example:

映射器发出:
abc 1
efg 1
cba 1
abc 1
dhh 1

Mapper emits:
abc 1
efg 1
cba 1
abc 1
dhh 1

在这种情况下,您可以看到映射器已多次发出ID"abc".

In this case, you can see that the ID 'abc' has been emitted more than one time by the Mapper.

我如何编辑此Reducer,使其仅发出重复副本?即值大于1的键:

How do I edit this Reducer so that it will only emit the duplicates? i.e. keys that have a value greater than 1:

import sys
import codecs

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
inData = codecs.getreader('utf-8')(sys.stdin)

(last_key, tot_cnt) = (None, 0)
for line in inData:
    (key, val) = line.strip().split("\t")
    if last_key and last_key != key:
        sys.stdout.write("%s\t%s\n" % (last_key,tot_cnt))
        (last_key, tot_cnt) = (key, int(val))
    else:
        (last_key, tot_cnt) = (key, tot_cnt + int(val))

if last_key:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

推荐答案

您在一些地方犯了错误.

You have committed mistake in few places.

  1. 此代码:

  1. This code:

if last_key and last_key != key:
    sys.stdout.write("%s\t%s\n" % (last_key,tot_cnt))

应更改为:

if last_key != key:
    if(tot_cnt > 1):
        sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

您没有检查tot_cnt > 1.

最后2行:

if last_key:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

应更改为:

if last_key and tot_cnt > 1:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

再次,您没有检查tot_cnt > 1.

以下是修改后的代码,它对我有用:

Following is the modified code, which works for me:

import sys
import codecs

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
inData = codecs.getreader('utf-8')(sys.stdin)

(last_key, tot_cnt) = (None, 0)
for line in inData:
    (key, val) = line.strip().split("\t")
    if last_key != key:
        if(tot_cnt > 1):
            sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))
        (last_key, tot_cnt) = (key, int(val))
    else:
        (last_key, tot_cnt) = (key, tot_cnt + int(val))

if last_key and tot_cnt > 1:
    sys.stdout.write("%s\t%s\n" % (last_key, tot_cnt))

对于您的数据,我得到以下输出:

I get following output, for your data:

abc     2

这篇关于如何使Reducer仅发出重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆