Hadoop 作业失败,资源管理器无法识别 AttemptID [英] Hadoop job fails, Resource Manager doesnt recognize AttemptID

查看:21
本文介绍了Hadoop 作业失败,资源管理器无法识别 AttemptID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Oozie 工作流程中聚合一些数据.但是聚合步骤失败了.

Im trying to aggregate some data in an Oozie workflow. However the aggregation step fails.

我在日志中发现了两个兴趣点:第一个是似乎反复出现的错误(?):

I found two points of interests in the logs: The first is an error(?) that seems to occur repeatedly:

容器完成后,它会被杀死,但会以非零退出代码 143 退出.

After a container finishes, it gets killed but exits with non-zero Exit code 143.

结束:

2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312
2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0

然后当它被 Application Master 杀死时:

And then then when it gets killed by Application Master:

2015-05-04 15:35:13,831 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1430730089455_0009_m_000048_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

第二个兴趣点是导致作业完全崩溃的实际错误,这发生在减少阶段,但不确定这两者是否相关:

The second point of interest is the actual error that crashes the job completely, this happens in the reduce-phase, not sure if these two are related though:

2015-05-04 15:35:28,767 INFO [IPC Server handler 20 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:29,930 INFO [IPC Server handler 10 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000052_0 is : 0.19511986
2015-05-04 15:35:31,549 INFO [IPC Server handler 1 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000050_0 is : 0.5324404
2015-05-04 15:35:31,771 INFO [IPC Server handler 28 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:31,890 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:675)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
    at java.lang.Thread.run(Thread.java:695)
Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:394)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy36.allocate(Unknown Source)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667)
    ... 3 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:394)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy35.allocate(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
    ... 11 more

之后 oozie:launcher 作业和出现错误的作业将无限期地坐在那里,状态为:已接受、最终状态:未定义和跟踪 UI:未分配.

After that the oozie:launcher job and the job that got the error just sit there indefinitely with STATE:accepted, FINALSTATUS:undefined and TRACKING UI:unassigned.

有谁知道导致此错误的原因以及如何解决?以前的工作流程相同,我不能说我在两者之间进行了任何更改...

Does anyone know what is causing this error and how I can fix it? The same workflow worked before, and I couldnt say that I changed anything inbetween...

推荐答案

以防万一其他人发现此错误:这似乎是由于 hadoop 磁盘空间不足引起的......对于某些东西来说非常神秘的错误就那么简单.我认为 ~90GB 足以处理我的 30GB 数据集,但我错了.

Just in case somebody else stubles upon this error: It seemed like this was caused due to hadoop running out of disc space... Pretty cryptic error for something as simple as that. I thought ~90GB would be enough to work on my 30GB Dataset, I was wrong.

这篇关于Hadoop 作业失败,资源管理器无法识别 AttemptID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆