Python中的Hadoop Streaming作业失败(不成功) [英] Hadoop Streaming Job Failed (Not Successful) in Python

查看:113
本文介绍了Python中的Hadoop Streaming作业失败(不成功)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用Python脚本在Hadoop Streaming上运行Map-Reduce作业,并得到与,但这些解决方案对我来说并不适用。



我的脚本在我运行cat sample.txt | ./p1mapper.py | sort | ./p1reducer.py



但是当我运行以下命令时:

  ./ bin / hadoop jar contrib / streaming / hadoop-0.20.2-streaming.jar \ 
-inputp1input / *\\ \\
-output p1output \
-mapperpython p1mapper.py\
-reducerpython p1reducer.py\
-file / Users / Tish / Desktop /HW1/p1mapper.py \
-file /Users/Tish/Desktop/HW1/p1reducer.py

(注意:即使我删除了python或为-mapper和-reducer输入了完整的路径名,结果也是一样的)

是超越ut我得到:

  packageJobJar:[/Users/Tish/Desktop/HW1/p1mapper.py,/ Users / Tish / Desktop /CS246/HW1/p1reducer.py,/Users/Tish/Documents/workspace/hadoop-0.20.2/tmp/hadoop-unjar4363616744311424878/] []的/ var /文件夹/ MK / MkDxFxURFZmLg + gkCGdO9U +++ TM / -TMP -  / streamjob3714058030803466665.jar tmpDir = null 
11/01/18 03:02:52信息mapred.FileInputFormat:要输入的总输入路径:1
11/01/18 03:02:52 INFO streaming .StreamJob:getLocalDirs():[tmp / mapred / local]
11/01/18 03:02:52 INFO streaming.StreamJob:正在运行的作业:job_201101180237_0005
11/01/18 03:02: 52 INFO streaming.StreamJob:要杀死这个工作,运行:
11/01/18 03:02:52 INFO streaming.StreamJob:/Users/Tish/Documents/workspace/hadoop-0.20.2/bin/。 ./bin/hadoop job -Dmapred.job.tracker = localhost:54311 -kill job_201101180237_0005
11/01/18 03:02:52 INFO streaming.StreamJob:跟踪URL:http://www.glassdoor.com :50030 / jobdetails.jsp?jobid = job_201101180237_0005
11/01/18 03:02:53 INFO streaming.S treamJob:map 0%reduce 0%
11/01/18 03:03:05 INFO streaming.StreamJob:map 100%减少0%
11/01/18 03:03:44 INFO streaming。 StreamJob:地图50%减少0%
11/01/18 03:03:47信息streaming.StreamJob:地图100%减少100%
11/01/18 03:03:47 INFO streaming。 StreamJob:要杀死这项工作,请运行:
11/01/18 03:03:47 INFO streaming.StreamJob:/Users/Tish/Documents/workspace/hadoop-0.20.2/bin/../bin/ hadoop job -Dmapred.job.tracker = localhost:54311 -kill job_201101180237_0005
11/01/18 03:03:47 INFO streaming.StreamJob:Tracking URL:http://www.glassdoor.com:50030/jobdetails .jsp?jobid = job_201101180237_0005
11/01/18 03:03:47错误streaming.StreamJob:作业不成功!
11/01/18 03:03:47 INFO streaming.StreamJob:killJob ...
Streaming Job Failed!

对于每个失败/杀死任务尝试:

 地图输出丢失,重新安排:getMapOutput(attempt_201101181225_0001_m_000000_0,0)失败:
$ org.apache.hadoop.util.DiskChecker DiskErrorException:找不到的TaskTracker / jobcache / job_201101181225_0001 / attempt_201101181225_0001_m_000000_0 /输出/ file.out.index在任何在org.apache.hadoop.fs.LocalDirAllocator $ AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
中的配置的本地目录
。在org.apache的。 hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)维持在javax.servlet.http包org.apache.hadoop.mapred.TaskTracker $ MapOutputServlet.doGet(TaskTracker.java:2887)

。在javax.servlet.http.HttpServlet.service HttpServlet.service(HttpServlet.java:707)
(HttpServlet.java:820)
在org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder。 java:502)
在org.mortbay.jetty.servlet.ServletHand ler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty .HttpConnection $ RequestHandler.headerComplete(HttpConnection.java:864)
在org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
在org.mortbay.jetty.HttpParser.parseAvailable(HttpParser .java:207)
at org.mortbay.jetty.HttpConnec tion.handle(HttpConnection.java:403)
在org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
在org.mortbay.thread.QueuedThreadPool $ PoolThread.run( QueuedThreadPool.java:522)

以下是我的Python脚本:
p1mapper.py



 #!/ usr / bin / env python 

import sys
import re

SEQ_LEN = 4

eos = re.compile('(?<= [a-zA-Z])\。')#字母前加一个字母
ignore = re.compile('[\W\d]')

用于sys.stdin中的行:
array = re.split(eos,line)
发送数组:
sent = ignore.sub('',发送)
sent = sent.lower()
如果len(发送)> = SEQ_LEN:
对于我在范围内(len(发送)-SEQ_LEN + 1):
print'%s 1'%sent [i:i + SEQ_LEN]

p1reducer.py

 #!/ usr / bin / env python 

来自运营商导入itemgetter
导入sys

word2count = {}

用于sys.stdin中的行:
word,count = line.split('', 1)
try:
count = int(count)
word2count [word] = word2count.get(word,0)+ count
除了ValueError:#count不是数字
传递

#sort
sorted_word2count = sorted(word2count.items(),key = itemgetter(1),reverse = True)

#write前3个序列
为单词,在sorted_word2count [0:3]中为count:
print'%s\t%s'%(word,count)



UPDATE:



非常感谢您的帮助,谢谢!
$ b

hdfs-site.xml:

 <?xml version =1.0?> ; 
<?xml-stylesheet type =text / xslhref =configuration.xsl?>

<! - 将特定于站点的属性覆盖到此文件中。 - >

<配置>

<属性>

<名称> dfs.replication< / name>

<值> 1< /值>

< / property>

< / configuration>

mapred-site.xml:

 <?xml version =1.0?> 
<?xml-stylesheet type =text / xslhref =configuration.xsl?>

<! - 将特定于站点的属性覆盖到此文件中。 - >

<配置>

<属性>

<名称> mapred.job.tracker< / name>

< value> localhost:54311< /值>

< / property>

< / configuration>


解决方案

您缺少很多配置,需要定义目录等。看到这里:



http://wiki.apache .org / hadoop / QuickStart



分布式操作就像上面描述的伪分布操作一样,除了:


  1. 在conf / hadoop-site.xml中的fs.default.name和mapred.job.tracker的值中指定主服务器的主机名或IP地址。这些被指定为主机:端口对。

  2. 在conf / hadoop-site.xml中为dfs.name.dir和dfs.data.dir指定目录。这些用于分别在主节点和从节点上保存分布式文件系统数据。请注意,dfs.data.dir可能包含以空格或逗号分隔的目录名列表,以便数据可以存储在多个设备上。
  3. 在conf中指定mapred.local.dir /hadoop-site.xml。这决定了写入临时MapReduce数据的位置。它也可以是一个目录列表。

  4. 在conf / mapred-default.xml中指定mapred.map.tasks和mapred.reduce.tasks。根据经验,使用mapx.map.tasks的从处理器数量的10倍,以及mapred.reduce.tasks的从处理器数量的2倍。
  5. 列出所有从属主机名或您的conf / slaves文件中的IP地址,每行一个,并确保jobtracker位于您的/ etc / hosts文件中,指向您的jobtracker节点


I'm trying to run a Map-Reduce job on Hadoop Streaming with Python scripts and getting the same errors as Hadoop Streaming Job failed error in python but those solutions didn't work for me.

My scripts work fine when I run "cat sample.txt | ./p1mapper.py | sort | ./p1reducer.py"

But when I run the following:

./bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
    -input "p1input/*" \
    -output p1output \
    -mapper "python p1mapper.py" \
    -reducer "python p1reducer.py" \
    -file /Users/Tish/Desktop/HW1/p1mapper.py \
    -file /Users/Tish/Desktop/HW1/p1reducer.py

(NB: Even if I remove the "python" or type the full pathname for -mapper and -reducer, the result is the same)

This is the output I get:

packageJobJar: [/Users/Tish/Desktop/HW1/p1mapper.py, /Users/Tish/Desktop/CS246/HW1/p1reducer.py, /Users/Tish/Documents/workspace/hadoop-0.20.2/tmp/hadoop-unjar4363616744311424878/] [] /var/folders/Mk/MkDxFxURFZmLg+gkCGdO9U+++TM/-Tmp-/streamjob3714058030803466665.jar tmpDir=null
11/01/18 03:02:52 INFO mapred.FileInputFormat: Total input paths to process : 1
11/01/18 03:02:52 INFO streaming.StreamJob: getLocalDirs(): [tmp/mapred/local]
11/01/18 03:02:52 INFO streaming.StreamJob: Running job: job_201101180237_0005
11/01/18 03:02:52 INFO streaming.StreamJob: To kill this job, run:
11/01/18 03:02:52 INFO streaming.StreamJob: /Users/Tish/Documents/workspace/hadoop-0.20.2/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201101180237_0005
11/01/18 03:02:52 INFO streaming.StreamJob: Tracking URL: http://www.glassdoor.com:50030/jobdetails.jsp?jobid=job_201101180237_0005
11/01/18 03:02:53 INFO streaming.StreamJob:  map 0%  reduce 0%
11/01/18 03:03:05 INFO streaming.StreamJob:  map 100%  reduce 0%
11/01/18 03:03:44 INFO streaming.StreamJob:  map 50%  reduce 0%
11/01/18 03:03:47 INFO streaming.StreamJob:  map 100%  reduce 100%
11/01/18 03:03:47 INFO streaming.StreamJob: To kill this job, run:
11/01/18 03:03:47 INFO streaming.StreamJob: /Users/Tish/Documents/workspace/hadoop-0.20.2/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201101180237_0005
11/01/18 03:03:47 INFO streaming.StreamJob: Tracking URL: http://www.glassdoor.com:50030/jobdetails.jsp?jobid=job_201101180237_0005
11/01/18 03:03:47 ERROR streaming.StreamJob: Job not Successful!
11/01/18 03:03:47 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

For each Failed/Killed Task Attempt:

Map output lost, rescheduling: getMapOutput(attempt_201101181225_0001_m_000000_0,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201101181225_0001/attempt_201101181225_0001_m_000000_0/output/file.out.index in any of the configured local directories
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
    at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2887)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:324)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
    at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
    at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
    at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
    at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)

Here are my Python scripts: p1mapper.py

#!/usr/bin/env python

import sys
import re

SEQ_LEN = 4

eos = re.compile('(?<=[a-zA-Z])\.')   # period preceded by an alphabet
ignore = re.compile('[\W\d]')

for line in sys.stdin:
    array = re.split(eos, line)
    for sent in array:
        sent = ignore.sub('', sent)
        sent = sent.lower()
        if len(sent) >= SEQ_LEN:
            for i in range(len(sent)-SEQ_LEN + 1):
                print '%s 1' % sent[i:i+SEQ_LEN]

p1reducer.py

#!/usr/bin/env python

from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
    word, count = line.split(' ', 1)
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:    # count was not a number
        pass

# sort
sorted_word2count = sorted(word2count.items(), key=itemgetter(1), reverse=True)

# write the top 3 sequences
for word, count in sorted_word2count[0:3]:
    print '%s\t%s'% (word, count)

Would really appreciate any help, thanks!

UPDATE:

hdfs-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

          <name>dfs.replication</name>

          <value>1</value>

</property>

</configuration>

mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

          <name>mapred.job.tracker</name>

          <value>localhost:54311</value>

</property>

</configuration>

解决方案

You are missing a lot of configurations and you need to define directories and such. See here:

http://wiki.apache.org/hadoop/QuickStart

Distributed operation is just like the pseudo-distributed operation described above, except:

  1. Specify hostname or IP address of the master server in the values for fs.default.name and mapred.job.tracker in conf/hadoop-site.xml. These are specified as host:port pairs.
  2. Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distributed filesystem data on the master node and slave nodes respectively. Note that dfs.data.dir may contain a space- or comma-separated list of directory names, so that data may be stored on multiple devices.
  3. Specify mapred.local.dir in conf/hadoop-site.xml. This determines where temporary MapReduce data is written. It also may be a list of directories.
  4. Specify mapred.map.tasks and mapred.reduce.tasks in conf/mapred-default.xml. As a rule of thumb, use 10x the number of slave processors for mapred.map.tasks, and 2x the number of slave processors for mapred.reduce.tasks.
  5. List all slave hostnames or IP addresses in your conf/slaves file, one per line and make sure jobtracker is in your /etc/hosts file pointing to your jobtracker node

这篇关于Python中的Hadoop Streaming作业失败(不成功)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆