Hadoop作业失败,资源管理器不能识别AttemptID [英] Hadoop job fails, Resource Manager doesnt recognize AttemptID

查看:2142
本文介绍了Hadoop作业失败,资源管理器不能识别AttemptID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在Oozie工作流中汇总一些数据。然而,汇总步骤失败。



我在日志中发现了两点兴趣:第一个错误(?)似乎反复出现:



在容器完成后,它会被杀死,但会以非零的退出代码143退出。



完成:

  2015-05-04 15:35:12,013信息[49697上的IPC服务器处理程序7] org.apache.hadoop.mapred.TaskAttemptListenerImpl:Progress TaskAttempt attempt_1430730089455_0009_m_000048_0是:0.7231312 
2015-05-04 15:35:12,015信息[49697上的IPC服务器处理程序19] org.apache.hadoop.mapred.TaskAttemptListenerImpl:TaskAttempt的进度attempt_1430730089455_0009_m_000048_0是:1.0

然后当它被Application Master杀死时:

  2015年5月4日15:35:13831 INFO [AsyncDispatcher事件处理程序] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:从attempt_1430730089455_0009_m_000048_0诊断报告:CONTA由ApplicationMaster杀死。 
容器在请求时死亡。退出代码是143
使用非零退出代码退出的容器143

第二点感兴趣的是完全崩溃工作的实际错误,这发生在reduce阶段,不确定这两者是否相关:

  2015年5月4日15:35:28767 INFO [上49697 IPC服务器处理程序20] org.apache.hadoop.mapred.TaskAttemptListenerImpl:TaskAttempt attempt_1430730089455_0009_m_000051_0的进步是:0.31450257 
2015年5月4日15: 35:29,930 INFO [在49697上的IPC服务器处理程序10] org.apache.hadoop.mapred.TaskAttemptListenerImpl:TaskAttempt的进度attempt_1430730089455_0009_m_000052_0是:0.19511986
2015-05-04 15:35:31,549 INFO [IPC Server handler 1 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl:TaskAttempt的进度attempt_1430730089455_0009_m_000050_0是:0.5324404
2015-05-04 15:35:31,771信息[49697上的IPC服务器处理程序28] org.apache.hadoop.mapred。 TaskAttemptListenerImpl:TaskA的进度尝试attempt_1430730089455_0009_m_000051_0是:0.31450257
2015-05-04 15:35:31,890错误[RMCommunator分配器] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:错误与RM通信:资源管理器不识别AttemptId:application_1430730089455_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:资源管理器无法识别org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator上的AttemptId:application_1430730089455_0009
。 getResources(RMContainerAllocator.java:675)
位于org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244)
位于org.apache.hadoop.mapreduce。 v2.app.rm.RMCommunicator $ 1.run(RMCommunator.java:282)
在java.lang.Thread.run(Thread.java:695)
引起:org.apache.hadoop.yarn .exceptions.ApplicationAttemptNotFoundException:应用程序尝试appattempt_1430730089455_0009_000001在ApplicationMasterService缓存中不存在。
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl。分配(ApplicationMasterProtocolPBServiceImpl.java:60)
在org.apache.hadoop.yarn.proto.ApplicationMasterProtocol $ ApplicationMasterProtocolService $ 2.callBlockingMethod(ApplicationMasterProtocol.java:99)
在org.apache.hadoop.ipc.ProtobufRpcEngine $ Server $ ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC $ Server.call(RPC.java:962)
at org.apache.hadoop.ipc .Server $ Handler $ 1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server $ Handler $ 1.run(Server.java:2035)$ b $ at java.security.AccessController .doPrivileged(Native Method)
位于javax.security.auth.Subject.doAs(Subject.java:394)
位于org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) $或
g.apache.hadoop.ipc.Server $ Handler.run(Server.java:2033)
$ b $ at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl .newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)$ b $ at java.lang.reflect.Constructor.newInstance(Constructor.java:513)$在org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)b
$ b。在org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
在java.lang.reflect.Method.invoke(Method.java:597)
在org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
在com.sun.proxy。$ Proxy36.allocate(来源不明)
在org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188)
。在组织.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667)
... 3 more
引起:org.apache.hadoop.ipc.RemoteException(org .apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException):ApplicationMasterService缓存中不存在应用程序尝试appattempt_1430730089455_0009_000001。
处org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
。分配(ApplicationMasterProtocolPBServiceImpl.java:60)
在org.apache.hadoop.yarn.proto.ApplicationMasterProtocol $ ApplicationMasterProtocolService $ 2.callBlockingMethod(ApplicationMasterProtocol.java:99)
在org.apache.hadoop.ipc.ProtobufRpcEngine $ Server $ ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC $ Server.call(RPC.java:962)
at org.apache.hadoop.ipc .Server $ Handler $ 1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server $ Handler $ 1.run(Server.java:2035)$ b $ at java.security.AccessController .doPrivileged(Native Method)
位于javax.security.auth.Subject.doAs(Subject.java:394)
位于org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) $或
g.apache.hadoop.ipc.Server $ Handler.run(Server.java:2033)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
在org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine $ Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy。$ Proxy35.allocate(不明来源)
在org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
。 .. 11多

之后,oozie:launcher工作和得到错误的工作就在那里无限期地与STATE:接受,FINALSTATUS:未定义和跟踪用户界面:未分配。

有谁知道是什么原因导致了这个错误,我该如何修复它?
以前工作过的工作流程相同,我不能说我改变了其中的任何内容......

解决方案

以防万一有人在这个错误上弄错了:看起来这是由于hadoop光盘空间不足而造成的......对于像这样简单的事情来说,这是非常神秘的错误。我认为〜90GB足以在我的30GB数据集上工作,我错了。

Im trying to aggregate some data in an Oozie workflow. However the aggregation step fails.

I found two points of interests in the logs: The first is an error(?) that seems to occur repeatedly:

After a container finishes, it gets killed but exits with non-zero Exit code 143.

It finishes:

2015-05-04 15:35:12,013 INFO [IPC Server handler 7 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 0.7231312
2015-05-04 15:35:12,015 INFO [IPC Server handler 19 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000048_0 is : 1.0

And then then when it gets killed by Application Master:

2015-05-04 15:35:13,831 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1430730089455_0009_m_000048_0: Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

The second point of interest is the actual error that crashes the job completely, this happens in the reduce-phase, not sure if these two are related though:

2015-05-04 15:35:28,767 INFO [IPC Server handler 20 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:29,930 INFO [IPC Server handler 10 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000052_0 is : 0.19511986
2015-05-04 15:35:31,549 INFO [IPC Server handler 1 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000050_0 is : 0.5324404
2015-05-04 15:35:31,771 INFO [IPC Server handler 28 on 49697] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1430730089455_0009_m_000051_0 is : 0.31450257
2015-05-04 15:35:31,890 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error communicating with RM: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager doesn't recognize AttemptId: application_1430730089455_0009
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:675)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
    at java.lang.Thread.run(Thread.java:695)
Caused by: org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:394)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
    at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy36.allocate(Unknown Source)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188)
    at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667)
    ... 3 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1430730089455_0009_000001 doesn't exist in ApplicationMasterService cache.
    at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
    at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
    at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:394)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

    at org.apache.hadoop.ipc.Client.call(Client.java:1468)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
    at com.sun.proxy.$Proxy35.allocate(Unknown Source)
    at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
    ... 11 more

After that the oozie:launcher job and the job that got the error just sit there indefinitely with STATE:accepted, FINALSTATUS:undefined and TRACKING UI:unassigned.

Does anyone know what is causing this error and how I can fix it? The same workflow worked before, and I couldnt say that I changed anything inbetween...

解决方案

Just in case somebody else stubles upon this error: It seemed like this was caused due to hadoop running out of disc space... Pretty cryptic error for something as simple as that. I thought ~90GB would be enough to work on my 30GB Dataset, I was wrong.

这篇关于Hadoop作业失败,资源管理器不能识别AttemptID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆