在 OOZIE-4.1.0 中运行多个工作流时出错 [英] Error on running multiple Workflow in OOZIE-4.1.0

查看:27
本文介绍了在 OOZIE-4.1.0 中运行多个工作流时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我按照以下步骤在 Linux 机器上安装了 oozie 4.1.0http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/

hadoop 版本 - 2.6.0行家 - 3.0.4猪 - 0.12.0

集群设置 -

MASTER NODE runnig - Namenode、Resourcemanager、proxyserver.

正在运行的从节点 -Datanode,Nodemanager.

当我运行单个工作流作业时意味着它成功了.但是当我尝试运行多个工作流作业时,即两个作业都处于接受状态

检查错误日志,我将问题细化为,

014-12-24 21:00:36,758 [JobControl] INFO org.apache.hadoop.ipc.Client - 重试连接到服务器:172.16.***.***/172.16.***.***:8032.已经尝试了9次;重试策略是 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2014-12-25 09:30:39,145 [通信线程] INFO org.apache.hadoop.ipc.Client - 重试连接到服务器:172.16.***.***/172.16.***.***:52406.已经尝试了9次;重试策略是 RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)2014-12-25 09:30:39,199 [通信线程] INFO org.apache.hadoop.mapred.Task - 通信异常:java.io.IOException:本地异常失败:java.net.SocketException:网络无法访问:否更多信息;主机详细信息:本地主机为:SystemName/127.0.0.1";目标主机是:172.16.***.***":52406;在 org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)在 org.apache.hadoop.ipc.Client.call(Client.java:1415)在 org.apache.hadoop.ipc.Client.call(Client.java:1364)在 org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)在 $Proxy9.ping(来源不明)在 org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:742)在 java.lang.Thread.run(Thread.java:722)引起:java.net.SocketException:网络无法访问:没有更多信息在 sun.nio.ch.SocketChannelImpl.checkConnect(本机方法)在 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)在 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)在 org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)在 org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)在 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)在 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)在 org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)在 org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)在 org.apache.hadoop.ipc.Client.call(Client.java:1382)……还有 5 个心跳心跳..

在上面运行的作业中,如果我手动杀死任何一个启动器作业 (hadoop job -kill ) 意味着所有作业都成功了.所以我认为问题是多个启动器工作同时运行意味着工作会遇到死锁..

如果有人知道上述问题的原因和解决方案.请尽快帮我.

解决方案

问题出在队列上,当我们在 SAME QUEUE(DEFAULT) 中运行作业时,上面的集群设置由 Resourcemanager 负责在salve节点运行mapreduce作业.由于从节点资源不足,队列中运行的作业会遇到死锁情况.

为了解决这个问题,我们需要通过触发不同队列中的mapreduce作业来拆分mapreduce作业.

您可以通过在 oozie workflow.xml

中的 pig 操作中设置此部分来完成此操作<预><代码><配置><财产><name>mapreduce.job.queuename</name><value>launcher2</value></属性>

注意:此解决方案仅适用于小型集群设置

I installed oozie 4.1.0 on a Linux machine by following the steps at http://gauravkohli.com/2014/08/26/apache-oozie-installation-on-hadoop-2-4-1/

hadoop version - 2.6.0 
maven - 3.0.4 
pig - 0.12.0

Cluster Setup -

MASTER NODE runnig - Namenode, Resourcemanager ,proxyserver.

SLAVE NODE running -Datanode,Nodemanager.

When I run single workflow job means it succeeds. But when I try to run more than one Workflow job i.e. both the jobs are in accepted state

Inspecting the error log, I drill down the problem as,

014-12-24 21:00:36,758 [JobControl] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,145 [communication thread] INFO  org.apache.hadoop.ipc.Client  - Retrying connect to server: 172.16.***.***/172.16.***.***:52406. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-12-25 09:30:39,199 [communication thread] INFO  org.apache.hadoop.mapred.Task  - Communication exception: java.io.IOException: Failed on local exception: java.net.SocketException: Network is unreachable: no further information; Host Details : local host is: "SystemName/127.0.0.1"; destination host is: "172.16.***.***":52406; 
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1415)
 at org.apache.hadoop.ipc.Client.call(Client.java:1364)
 at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
 at $Proxy9.ping(Unknown Source)
 at org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:742)
 at java.lang.Thread.run(Thread.java:722)
Caused by: java.net.SocketException: Network is unreachable: no further information
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
 at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
 at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
 at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
 at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
 at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
 at org.apache.hadoop.ipc.Client.call(Client.java:1382)
 ... 5 more

Heart beat
Heart beat
.
.

In the above running jobs, if I kill any one launcher job manually (hadoop job -kill <launcher-job-id>) mean all jobs get succeeded. So I think the problem is more than one launcher job running simultaneously mean job will meet deadlock..

If anyone know the reason and solution for above problem. Please do me the favor as soon as possible.

解决方案

The problem is with the Queue, When we running the Job in SAME QUEUE(DEFAULT) with above cluster setup the Resourcemanager is responsible to run mapreduce job in the salve node. Due to lack of resource in slave node the job running in the queue will meet Deadlock situation.

In order to over come this issue we need to split the Mapreduce job by means of Triggering the mapreduce job in different queue.

you can do this by setting this part in the pig action inside your oozie workflow.xml

<configuration>
<property>
  <name>mapreduce.job.queuename</name>
  <value>launcher2</value>
</property>

NOTE: This solution only for SMALL CLUSTER SETUP

这篇关于在 OOZIE-4.1.0 中运行多个工作流时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆