Apache Spark YARN模式启动时间太长(10+秒) [英] Apache Spark YARN mode startup takes too long (10+ secs)

查看:209
本文介绍了Apache Spark YARN模式启动时间太长(10+秒)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用YARN-client或YARN-cluster模式运行Spark应用程序。



但启动似乎需要很长时间。



初始化spark上下文需要10秒以上的时间。



这是正常的吗?或者可以优化它?



环境如下:


  • Hadoop :Hortonworks HDP 2.2(Hadoop 2.6)(带3个数据节点的微小测试集群)
  • Spark:1.3.1客户端:Windows 7,但是类似的结果CentOS 6.6



以下是应用程序日志的启动部分。 (一些私人信息被编辑)

'Main:Initializing context'at the first line and'MainProcessor:Deleting previous output files'at the last line are the logs by the应用。其他人来自Spark本身。显示此日志后执行应用程序逻辑。

  15/05/07 09:18:31信息Main:初始化上下文
15/05/07 09:18:31信息SparkContext:运行Spark版本1.3.1
15/05/07 09:18:31 INFO SecurityManager:将视图acls更改为:myuser,myapp
15/05/07 09:18:31 INFO SecurityManager:将修改的acls更改为:myuser,myapp
15/05/07 09:18:31 INFO SecurityManager:SecurityManager:身份验证已禁用;用户禁用;拥有查看权限的用户:Set(myuser,myapp);具有修改权限的用户:Set(myuser,myapp)
15/05/07 09:18:31信息Slf4jLogger:Slf4jLogger开始
15/05/07 09:18:31信息远程处理:启动远程处理
15/05/07 09:18:31信息远程处理:远程处理已开始;监听地址:[akka.tcp:// sparkDriver @ mymachine:54449]
15/05/07 09:18:31信息实用程序:成功启动端口54449上的服务'sparkDriver'。
15 / 05/07 09:18:31信息SparkEnv:注册MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv:注册BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager:Created本地目录位于C:\ Users \myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32信息MemoryStore:MemoryStore以容量启动1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer:HTTP文件服务器目录为C:\ Users \myuser\\ \\AppData\Local\Temp\spark -ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer :启动HTTP服务器
15/05/07 09:18:32信息服务器:jetty-8.yz-SNAPSHOT
15/05/07 09:18:32 INFO AbstractC onnector:已启动SocketConnector@0.0.0.0:54450
15/05/07 09:18:32信息实用程序:成功启动端口54450上的服务HTTP文件服务器。
15/05/07 09: 18:32 INFO SparkEnv:注册OutputCommitCoordinator
15/05/07 09:18:32信息服务器:jetty-8.yz-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector:Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32信息实用程序:在端口4040上成功启动服务SparkUI。
15/05/07 09:18:32信息SparkUI :在http:// mymachine:4040
启动SparkUI 15/05/07 09:18:32 INFO SparkContext:在http://10.111添加了JAR文件:/ D:/Projects/MyApp/MyApp.jar。 111.199:54450 / jars /带时间戳的MyApp.jar 1430957912240
15/05/07 09:18:32信息RMProxy:连接到cluster01 / 10.111.111.11处的ResourceManager:8050
15/05/07 09 :18:32 INFO客户端:从3个NodeManagers请求集群中的新应用程序
15/05/07 09:18:32信息客户端:验证我们的应用程序没有要求超过最大值m内存容量(每个容器23040 MB)
15/05/07 09:18:32信息客户端:将分配AM容器,包含896 MB内存,包括384 MB开销
15/05 / 07 09:18:32 INFO客户端:为我们的AM
设置容器启动上下文15/05/07 09:18:32 INFO客户端:为我们的AM容器准备资源
15/05/07 09 :18:32 INFO客户端:源文件系统和目标文件系统是相同的。不复制hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32信息客户端:为我们的AM设置启动环境容器
15/05/07 09:18:33 INFO SecurityManager:将视图acls更改为:myuser,myapp
15/05/07 09:18:33 INFO SecurityManager:将修改的acls更改为:myuser, myapp
15/05/07 09:18:33 INFO SecurityManager:SecurityManager:身份验证已禁用;用户禁用;拥有查看权限的用户:Set(myuser,myapp);具有修改权限的用户:Set(myuser,myapp)
15/05/07 09:18:33信息客户端:将应用程序2提交给ResourceManager
15/05/07 09:18:33信息YarnClientImpl:提交的申请application_1430956687773_0002
15/05/07 09:18:34 INFO客户:申请报告application_1430956687773_0002(州:已接受)
15/05/07 09:18:34信息客户:
客户机令牌:不适用
诊断:不适用
ApplicationMaster主机:不适用
ApplicationMaster RPC端口:-1
队列:默认
开始时间:1430957906540
最终状态:UNDEFINED
跟踪URL:http:// cluster01:8088 / proxy / application_1430956687773_0002 /
用户:myapp
15/05/07 09:18:35信息客户端:application_1430956687773_0002(state:ACCEPTED)的申请报告
15/05/07 09:18:36信息客户:申请报告application_1430956687773_0002(州:ACCEPTED)
15/05/07 09:18:37 INFO客户端:application_143095的应用程序报告6687773_0002(state:ACCEPTED)
15/05/07 09:18:37信息YarnClientSchedulerBackend:ApplicationMaster注册为Actor [akka.tcp:// sparkYarnAM @cluster02:39698 / user / YarnAM#-1579648782]
15/05/07 09:18:37信息YarnClientSchedulerBackend:添加WebUI筛选器。 org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,Map(PROXY_HOSTS-> cluster01,PROXY_URI_BASES-> http:// cluster01:8088 / proxy / application_1430956687773_0002),/ proxy / application_1430956687773_0002
15 / 05/07 09:18:37信息JettyUtils:添加筛选器:org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38信息客户端:应用程序报告application_1430956687773_0002(state:RUNNING)
15/05/07 09:18:38信息客户端:
客户端令牌:不适用
诊断:不适用
ApplicationMaster主机:cluster02
ApplicationMaster RPC端口:0
队列:默认
开始时间:1430957906540​​
最终状态:UNDEFINED
跟踪URL:http:// cluster01:8088 / proxy / application_1430956687773_0002 /
用户:myapp
15/05/07 09:18:38信息YarnClientSchedulerBackend:应用程序application_1430956687773_0002已开始运行。
15/05/07 09:18:38信息NettyBlockTransferService:在54491
上创建的服务器15/05/07 09:18:38信息BlockManagerMaster:试图注册BlockManager
15/05 / 07 09:18:38 INFO BlockManagerMasterActor:注册块管理器mymachine:54491与1956.7 MB RAM,BlockManagerId(< driver> ;, mymachine,54491)
15/05/07 09:18:38 INFO BlockManagerMaster:Registered BlockManager
15/05/07 09:18:43信息YarnClientSchedulerBackend:注册执行者:演员[akka.tcp:// sparkExecutor @cluster02:44996 / user / Executor#-786778979] ID 1
15 / 05/07 09:18:43信息YarnClientSchedulerBackend:SchedulerBackend已准备好在达到minRegisteredResourcesRatio后开始调度:0.8
15/05/07 09:18:43信息MainProcessor:删除以前的输出文件

谢谢。

更新



我认为我找到了(也许是部分的,但主要的)原因。



它介于以下几行之间:

  15/05/08 11:36:32信息BlockManagerMaster:注册BlockManager 
15/05/08 11:36: 38 INFO YarnClientSchedulerBackend:注册执行者:演员[akka.tcp:// sparkExecutor @ cluster04:55237 / user / Executor#-149550753]与ID 1

当我在集群一侧读取日志时,发现了以下几行:(确切的时间与上面的行不同,但它是机器之间的差异)

  15/05/08 11:36:23 INFO yarn.ApplicationMaster:开始进度记者线程 - 睡眠时间:5000 
15/05/08 11 :36:28 INFO impl.AMRMClientImpl:收到新的令牌:cluster04:45454

看起来Spark故意睡5秒。

我读过Spark源代码,并在org.apache.spark.deploy.yarn.ApplicationMaster.scala,launchReporterThread()代码。
它循环调用allocator.allocateResources()和Thread.sleep()。
对于睡眠,它读取配置变量spark.yarn.scheduler.heartbeat.interval-ms(默认值为5000,即5秒)。
根据评论,我们希望得到合理的响应而不会导致太多的RM请求。
因此,除非YARN立即满足分配请求,否则5秒似乎会被浪费。



当我将配置变量修改为1000时,它只等待为1秒。

这是更改后的日志行:

  15/05/08 11:47:21信息yarn.ApplicationMaster:开始进度记者线程 - 睡眠时间:1000 
15/05/08 11:47:22信息impl.AMRMClientImpl:收到新令牌: cluster04:45454

保存4秒。

因此,当人们不想等5秒时,可以更改 spark.yarn.scheduler.heartbeat.interval-ms

我希望它引起的额外开销可以忽略不计。

更新



相关的JIRA问题已经打开并解决。请参阅 https://issues.apache.org/jira/browse/SPARK-7533

解决方案

这很典型。我的系统在运行spark-submit之前需要大约20秒的时间,直到获取SparkContext。



正如它在几个地方的文档中所说的,解决方案是将您的驱动程序到一个RPC服务器。这样你就可以初始化一次,然后其他应用程序可以使用驱动程序的上下文作为服务。



我正在使用我的应用程序执行此操作。我正在使用http4s并将我的驱动程序转换为Web服务器。


I’m running a spark application with YARN-client or YARN-cluster mode.

But it seems to take too long to startup.

It takes 10+ seconds to initialize the spark context.

Is this normal? Or can it be optimized?

The environment is as follows:

  • Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) (Tiny test cluster with 3 data nodes)
  • Spark: 1.3.1
  • Client: Windows 7, but similar result on CentOS 6.6

The following is the startup part of the application log. (Some private information was edited)

‘Main: Initializing context’ at the first line and ‘MainProcessor: Deleting previous output files’ at the last line are the logs by the application. Others in between are from Spark itself. Application logic is executed after this log is displayed.

15/05/07 09:18:31 INFO Main: Initializing context
15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
15/05/07 09:18:31 INFO Remoting: Starting remoting
15/05/07 09:18:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@mymachine:54449]
15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' on port 54449.
15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity 1956.7 MB
15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SocketConnector@0.0.0.0:54450
15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file server' on port 54450.
15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/07 09:18:32 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/07 09:18:32 INFO SparkUI: Started SparkUI at http://mymachine:4040
15/05/07 09:18:32 INFO SparkContext: Added JAR file:/D:/Projects/MyApp/MyApp.jar at http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at cluster01/10.111.111.11:8050
15/05/07 09:18:32 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/05/07 09:18:32 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (23040 MB per container)
15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/05/07 09:18:32 INFO Client: Setting up container launch context for our AM
15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
15/05/07 09:18:32 INFO Client: Source and destination file systems are the same. Not copying hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
15/05/07 09:18:32 INFO Client: Setting up the launch environment for our AM container
15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: myuser,myapp
15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(myuser, myapp); users with modify permissions: Set(myuser, myapp)
15/05/07 09:18:33 INFO Client: Submitting application 2 to ResourceManager
15/05/07 09:18:33 INFO YarnClientImpl: Submitted application application_1430956687773_0002
15/05/07 09:18:34 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:34 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:35 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:36 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO Client: Application report for application_1430956687773_0002 (state: ACCEPTED)
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@cluster02:39698/user/YarnAM#-1579648782]
15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> cluster01, PROXY_URI_BASES -> http://cluster01:8088/proxy/application_1430956687773_0002), /proxy/application_1430956687773_0002
15/05/07 09:18:37 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/05/07 09:18:38 INFO Client: Application report for application_1430956687773_0002 (state: RUNNING)
15/05/07 09:18:38 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: cluster02
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1430957906540
     final status: UNDEFINED
     tracking URL: http://cluster01:8088/proxy/application_1430956687773_0002/
     user: myapp
15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application application_1430956687773_0002 has started running.
15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on 54491
15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register BlockManager
15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block manager mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, 54491)
15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster02:44996/user/Executor#-786778979] with ID 1
15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files

Thanks.

UPDATE

I think I’ve found the (maybe partial, but major) reason.

It’s between the following lines:

15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@cluster04:55237/user/Executor#-149550753] with ID 1

When I read the logs on cluster side, the following lines were found: (the exact time is different with above line, but it’s the difference between machines)

15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

It seemed that Spark deliberately sleeps 5 secs.

I’ve read the Spark source code, and in org.apache.spark.deploy.yarn.ApplicationMaster.scala, launchReporterThread() had the code for that. It loops calling allocator.allocateResources() and Thread.sleep(). For sleep, it reads the configuration variable spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, which is 5 secs). According to the comment, "we want to be reasonably responsive without causing too many requests to RM". So, unless YARN immediately fulfill the allocation request, it seems that 5 secs will be wasted.

When I modified the configuration variable to 1000, it only waited for 1 sec.

Here is the log lines after the change:

15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 1000
15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : cluster04:45454

4 secs saved.

So, when one does not want to wait 5 secs, one can change the spark.yarn.scheduler.heartbeat.interval-ms.

I hope that the additional overhead it incurs would be negligible.

UPDATE

A related JIRA issue had been opened and resolved. See https://issues.apache.org/jira/browse/SPARK-7533

解决方案

This is pretty typical. My system takes about 20 seconds from running spark-submit until getting a SparkContext.

As it says in the docs in a couple of places, the solution is to turn your driver into an RPC server. That way you initialize once, and then other applications can use the driver's context as a service.

I am in the middle of doing this with my application. I am using http4s and turning my driver into a web server.

这篇关于Apache Spark YARN模式启动时间太长(10+秒)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆