Hadoop Streaming Job在Python中失败的错误 [英] Hadoop Streaming Job failed error in python

查看:273
本文介绍了Hadoop Streaming Job在Python中失败的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

From 本指南,我已成功运行示例练习。但是在运行我的mapreduce作业时,出现以下错误

错误streaming.StreamJob:作业不成功!

10/12/16 17:13 :38 INFO streaming.StreamJob:killJob ...

Streaming Job Failed!


来自日志文件的错误


$ b6 pre $ java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为2
,位于org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed .java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132 )
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org。 apache.hadoop.mapred.Child.main(Child.java:170)

地图per.py

  import sys 

i = 0

for line in sys.stdin:
i + = 1
count = {}
for line.strip()。split():
count [word] = count.get(word ,0)+1
for word,count.items()中的权重:
print'%s''%s'%(word,str(i),str(weight) )

Reducer.py

  import sys 
$ b $ keymap = {}
o_tweet =2323
id_list = []
for sys.stdin:
tweet,tw = line.strip()。split()
#print tweet,o_tweet,tweet_id,id_list
tweet_id,w = tw.split(':')
w = int(w)
if tweet .__ eq __(o_tweet):
for i,wt in id_list:
print'%s:%s\t%s'%(tweet_id,i, str(w + wt))
id_list.append((tweet_id,w))
else:
id_list = [(tweet_id,w)]
o_tweet = tweet

[edit]运行作业的命令:

 的hadoop @ ubun tu:/ usr / local / hadoop $ bin / hadoop jar contrib / streaming / hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file / home /hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input / * -output my-output 

输入是句子的随机序列。



谢谢,

解决方案

  hadoop @ ubuntu:/ usr 

/ local / hadoop $ bin / hadoop jar contrib / streaming / hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer .py -input my-input / * -output my-output

当脚本在作业中也就是在hdfs中的另一个文件夹中,它与执行。的尝试任务相关。 (仅供参考,如果您想添加另一个文件,例如查找表,您可以在Python中打开它,就好像它与脚本位于同一个目录中,而脚本处于M / R作业中)。



也确保你有chmod a + x mapper.py和chmod a + x reducer.py


From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful!
10/12/16 17:13:38 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Error from the log file

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Mapper.py

import sys

i=0

for line in sys.stdin:
    i+=1
    count={}
    for word in line.strip().split():
        count[word]=count.get(word,0)+1
    for word,weight in count.items():
        print '%s\t%s:%s' % (word,str(i),str(weight))

Reducer.py

import sys

keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
    tweet,tw=line.strip().split()
    #print tweet,o_tweet,tweet_id,id_list
    tweet_id,w=tw.split(':')
    w=int(w)
    if tweet.__eq__(o_tweet):
        for i,wt in id_list:
            print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
        id_list.append((tweet_id,w))
    else:
        id_list=[(tweet_id,w)]
        o_tweet=tweet

[edit] command to run the job:

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output

Input is any random sequence of sentences.

Thanks,

解决方案

Your -mapper and -reducer should just be the script name.

hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output

When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)

also make sure you have chmod a+x mapper.py and chmod a+x reducer.py

这篇关于Hadoop Streaming Job在Python中失败的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆