org.apache.spark.rpc.RpcTimeoutException:期货在 [120 秒] 后超时.这个超时由 spark.rpc.lookupTimeout 控制 [英] org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
问题描述
在向 YARN 提交 Spark 应用程序时出现以下关于容器的错误.HADOOP(2.7.3)/SPARK(2.1)环境在单节点集群中运行伪分布式模式.该应用程序在本地模型中运行时完美运行,但尝试使用 YARN 作为 RM 在集群模式下检查其正确性并遇到一些障碍.刚接触这个世界,因此寻求帮助.
Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help.
--- 应用程序日志
2017-04-11 07:13:28 INFO Client:58 - Submitting application 1 to ResourceManager
2017-04-11 07:13:28 INFO YarnClientImpl:174 - Submitted application application_1491909036583_0001 to ResourceManager at /0.0.0.0:8032
2017-04-11 07:13:29 INFO Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:29 INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1491909208425
final status: UNDEFINED
tracking URL: http://ip-xxx.xx.xx.xxx:8088/proxy/application_1491909036583_0001/
user: xxxx
2017-04-11 07:13:30 INFO Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:31 INFO Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:13:32 INFO Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
2017-04-11 07:17:37 INFO Client:58 - Application report for application_1491909036583_0001 (state: FAILED)
2017-04-11 07:17:37 INFO Client:58 -
client token: N/A
diagnostics: Application application_1491909036583_0001 failed 2 times due to AM Container for appattempt_1491909036583_0001_000002 exited with exitCode: 10
For more detailed output, check application tracking page:http://"hostname":8088/cluster/app/application_1491909036583_0001Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1491909036583_0001_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
****--- 容器日志****
****--- Container Logs****
2017-04-11 07:13:30 INFO ApplicationMaster:47 - Registered signal handlers for [TERM, HUP, INT]
2017-04-11 07:13:31 INFO ApplicationMaster:59 - ApplicationAttemptId: appattempt_1491909036583_0001_000001
2017-04-11 07:13:32 INFO SecurityManager:59 - Changing view acls to: root,xxxx
2017-04-11 07:13:32 INFO SecurityManager:59 - Changing modify acls to: root,xxxx
2017-04-11 07:13:32 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, xxxx); users with modify permissions: Set(root, xxxx)
2017-04-11 07:13:32 INFO Slf4jLogger:80 - Slf4jLogger started
2017-04-11 07:13:32 INFO Remoting:74 - Starting remoting
2017-04-11 07:13:32 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446]
2017-04-11 07:13:32 INFO Remoting:74 - Remoting now listens on addresses: [akka.tcp://sparkYarnAM@xxx.xx.xx.xxx:45446]
2017-04-11 07:13:32 INFO Utils:59 - Successfully started service 'sparkYarnAM' on port 45446.
2017-04-11 07:13:32 INFO ApplicationMaster:59 - Waiting for Spark driver to be reachable.
2017-04-11 07:13:32 INFO ApplicationMaster:59 - Driver now available: xxx.xx.xx.xxx:47503
2017-04-11 07:15:32 ERROR ApplicationMaster:96 - Uncaught exception:
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)
at org.apache.spark.deploy.yarn.ApplicationMaster.runAMEndpoint(ApplicationMaster.scala:279)
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:473)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:315)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:157)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:625)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:646)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
... 16 more
2017-04-11 07:15:32 INFO ApplicationMaster:59 - Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout)
2017-04-11 07:15:32 INFO ShutdownHookManager:59 - Shutdown hook called
--Yarn Node Manager 出现故障时的日志
2017-04-11 07:15:18,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:21,735 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:24,742 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:27,749 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:30,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
2017-04-11 07:15:33,018 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1491909036583_0001_01_000001 is : 10
2017-04-11 07:15:33,019 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1491909036583_0001_01_000001 and exit code: 10
ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
-- SparkCONtext 参数
-- SparkCOntext parameters
<!-- Spark Configuration -->
<bean id="sparkInfo" class="SparkInfo">
<property name="appName" value="framework"></property>
<property name="master" value="yarn-client"></property>
<property name="dynamicAllocation" value="false"></property>
<property name="executorInstances" value="2"></property>
<property name="executorMemory" value="1g"></property>
<property name="executorCores" value="4"></property>
<property name="executorCoresMax" value="2"></property>
<property name="taskCpus" value="4"></property>
<property name="executorClassPath" value="/usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/*"></property>
<property name="yarnJar"
value="${framework.hdfsURI}/app/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar"></property>
<property name="yarnQueue" value="default"></property>
<property name="memoryFraction" value="0.4"></property>
</bean>
sparks.default.conf
spark.driver.memory 1g
spark.executor.extraJavaOptions -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
spark.rpc.lookupTimeout 600s
yarn-site.xml
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3096</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3096</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
</configuration>
推荐答案
你可以不断增加 spark.network.timeout
直到你不再看到问题,正如 hemanshuIIITian 在评论中提到的.
spark工作量大时,会出现超时异常.如果您的执行程序内存不足,那么 GC 可能会使系统非常繁忙,从而增加工作量.如果出现内存不足错误,请查看日志.请在 spark.executor.extraJavaOptions
中启用 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
并查看日志是否在任务完成之前多次调用 full GC.如果是这种情况,请增加您的 executorMemory
.这应该有望解决您的问题.
You can keep increasing spark.network.timeout
until you stop seeing the problem , as mentioned by himanshuIIITian in comment.
When spark is under heavy workload, timeout exception can occur. If you have low executor memory then GC may keep system very busy which increases workload. Look into the logs if there is Out Of Memory error. Please enable -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
in spark.executor.extraJavaOptions
and look into logs if full GC is invoked a number of times before a task completes. If that is the case then increase your executorMemory
. That should hopefully solve your problem.
这篇关于org.apache.spark.rpc.RpcTimeoutException:期货在 [120 秒] 后超时.这个超时由 spark.rpc.lookupTimeout 控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!