Hadoop:输出文件具有双输出 [英] Hadoop: Output file has double output
问题描述
我正在运行Hadoop程序,并将以下内容作为我的输入文件 input.txt :
I am running a Hadoop program and have the following as my input file, input.txt:
1
2
mapper.py :
import sys
for line in sys.stdin:
print line,
print "Test"
reducer.py :
reducer.py:
import sys
for line in sys.stdin:
print line,
当我在没有Hadoop的情况下运行它时:
$ cat ./input.txt | ./mapper.py | ./reducer.py ,结果如预期:
When I run it without Hadoop: $ cat ./input.txt | ./mapper.py | ./reducer.py, the output is as expected:
1
2
Test
然而,通过流API通过Hadoop运行它(如这里),后者部分输出看起来有点加倍:
However, running it through Hadoop via the streaming API (as described here), the latter part of the output seems somewhat "doubled":
1
2
Test
Test
此外,当我通过Hadoop运行该程序时,似乎它有1/4的机会由于这个原因失败:
Aditionally, when I run the program through Hadoop, it seems like it has a 1/4 chance of failing due to this:
错误:失败的地图任务数超过允许的限制。 FailedCount:1。
Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
我已经查看了一段时间,无法弄清楚我没有收到什么。如果有人能够帮助解决这些问题,我将不胜感激!感谢。
I've looked at this for some time and can't figure out what I'm not getting. If anyone could help with these issues, I would greatly appreciate it! Thanks.
编辑:当 input.txt 是:
1
2
3
4
5
6
7
8
9
10
输出结果为:
The output is:
1
10
2
3
4
5
6
7
8
9
Test
Test
推荐答案
它给出相同的输出。我想你只是将reducer的位置指定为mapper.py。
请确保您提供了reducer.py
It gives the same output. I guess you are specifying the location of reducer to mapper.py only. Make sure you are providing correct path to reducer.py
这篇关于Hadoop:输出文件具有双输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!