远程将作业提交到Spark EC2集群 [英] Submitting jobs to Spark EC2 cluster remotely

查看:199
本文介绍了远程将作业提交到Spark EC2集群的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用Spark设置了EC2集群。一切正常,所有主/从服务器都已启动并运行。

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.

我正在尝试提交示例作业(SparkPi)。当我ssh进行群集并从那里提交它时,一切正常。但是,在远程主机(我的笔记本电脑)上创建驱动程序后,它不起作用。我已经尝试过 --deploy-mode 的两种模式:

I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:

--deploy-mode = client

--deploy-mode=client:

从我的笔记本电脑上:

./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

结果在以下不确定的警告/错误中:

Results in the following indefinite warnings/errors:


WARN TaskSchedulerImpl:初始作业未接受任何资源;
检查您的群集UI,以确保已注册工作人员并具有
足够的内存15/02/22 18:30:45

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 15/02/22 18:30:45

ERROR SparkDeploySchedulerBackend:要求删除不存在的执行程序0 15/02/22 18:30:45

ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45

ERROR SparkDeploySchedulerBackend:要求删除不存在的执行程序1

ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1

...和失败的驱动程序-在Spark Web UI中,出现 State = ERROR的 Completed Drivers。

...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.

我尝试通过内核和内存限制来提交脚本,但这无济于事...

I've tried to pass limits for cores and memory to submit script but it didn't help...

-deploy-mode = cluster

--deploy-mode=cluster:

从我的笔记本电脑上:

./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar

结果是:


....驱动程序已成功提交为驱动器r-20150223023734-0007 ...
等待轮询主控器以获取驱动程序状态...轮询主控器以获取
驱动器状态driver-20150223023734-0007的状态为错误异常
来自群集,其内容为:java .io.FileNotFoundException:文件
文件:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
不存在。 java.io.FileNotFoundException:文件
文件:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
不存在。在
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)在
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)的
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329)处的
org.apache.spark.deploy.worker.DriverRunner.org $ apache $ spark $ deploy $ worker $ DriverRunner $$ downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner $$ anon $ 1.run(DriverRunner.scala:75)

.... Driver successfully submitted as driver-20150223023734-0007 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150223023734-0007 is ERROR Exception from cluster was: java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. java.io.FileNotFoundException: File file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)

因此,对于出现问题的任何指示以及如何从远程客户端部署作业的指导,我将不胜感激。谢谢。

So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.

更新:
因此,对于群集模式下的第二个问题,该文件必须在每个群集节点中全局可见,因此它必须位于可访问的位置。这样可以解决IOException,但会导致与客户端模式下相同的问题。

UPDATE: So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.

推荐答案

文档位于:

http: //spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security

列出所有不同的通信渠道在Spark集群中使用。如您所见,从执行器到驱动程序之间建立了连接。当您使用-deploy-mode = client 运行时,驱动程序将在笔记本电脑上运行,因此执行者将尝试与笔记本电脑建立连接。如果您的执行者在其下运行的AWS安全组阻止了您笔记本电脑的出站流量(Spark EC2脚本创建的默认安全组没有),或者您位于路由器/防火墙之后(更有可能),则它们将无法连接

lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.

因此,要解决此问题,必须将所有必需的端口转发到笔记本电脑,或重新配置防火墙以允许连接到端口。看到一堆端口是随机选择的,这意味着打开了广泛的端口(如果不是全部)。因此,使用群集中的-deploy-mode = cluster client 可能会减轻痛苦。

So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.

这篇关于远程将作业提交到Spark EC2集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆