只能从星火连接到Mesos在同一台机器上 [英] Can only connect to Mesos from Spark on the same machine

查看:182
本文介绍了只能从星火连接到Mesos在同一台机器上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想一个Mesos集群上运行的火花。

I am trying to running Spark on a Mesos cluster.

当我运行 ./斌/火花壳--master mesos://主机:5050 从我运行Mesos掌握一切工作的机器。但是,如果我运行在不同的机器相同的命令,程序试图连接后,结束了挂:

When I run ./bin/spark-shell --master mesos://host:5050 from the machine where I run the Mesos master everything works. However if I run the same command from a different machine, process ends up hanging after trying to connect:

I0825 07:30:10.184141 27380 sched.cpp:126] Version: 0.19.0
I0825 07:30:10.187476 27385 sched.cpp:222] New master detected at master@192.168.0.241:5050
I0825 07:30:10.187619 27385 sched.cpp:230] No credentials provided. Attempting to register without authentication

在mesos主我看到下面的输出:

On the mesos master I see the following output:

[...]
I0825 15:30:23.928402 23214 master.cpp:684] Giving framework 20140825-143817-4043352256-5050-23194-0002 0ns to failover
I0825 15:30:23.929033 23210 master.cpp:2849] Framework failover timeout, removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929095 23210 master.cpp:3344] Removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929687 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):6831; disk(*):455983; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-31 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.935073 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):15001; disk(*):917264; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-29 from framework   20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938248 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):6823; disk(*):455991; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-32 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938356 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):4939; disk(*):457873; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-28 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938397 23210 hierarchical_allocator_process.hpp:362] Removed framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:27.952940 23215 http.cpp:452] HTTP request for '/master/state.json'
W0825 15:30:29.595441 23208 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-32 on slave 20140822-144404-4043352256-5050-15999-32 at slave(1)@192.168.0.233:5051 (cluster2)
W0825 15:30:29.596709 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-29 on slave 20140822-144404-4043352256-5050-15999-29 at slave(1)@192.168.0.241:5051 (cluster4)
W0825 15:30:29.615630 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-31 on slave 20140822-144404-4043352256-5050-15999-31 at slave(1)@192.168.0.213:5051 (cluster3)
W0825 15:30:29.935130 23214 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-28 on slave 20140822-144404-4043352256-5050-15999-28 at slave(1)@192.168.0.212:5051 (cluster1)

当作为从机输出

[...]
I0825 15:30:08.450343   980 slave.cpp:1337] Asked to shut down framework 20140825-143817-4043352256-5050-23194-0002 by master@192.168.0.241:5050
I0825 15:30:08.455153   980 slave.cpp:1362] Shutting down framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:08.455401   980 slave.cpp:2698] Shutting down executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456045   982 slave.cpp:2768] Killing executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456217   982 mesos_containerizer.cpp:992] Destroying container '37cc2b09-0e6d-4738-a837-7956367bba2b'
I0825 15:30:14.134845   977 mesos_containerizer.cpp:1108] Executor for container '37cc2b09-0e6d-4738-a837-7956367bba2b' has exited
I0825 15:30:14.135220   978 slave.cpp:2413] Executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002 has terminated with signal Killed
I0825 15:30:14.135356   978 slave.cpp:2552] Cleaning up executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135499   978 slave.cpp:2627] Cleaning up framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135627   976 status_update_manager.cpp:282] Closing status update streams for framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135571   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31/runs/37cc2b09-0e6d-4738-a837-7956367bba2b' for gc 6.99999843242074days in the future
I0825 15:30:14.135910   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31' for gc 6.99999843187556days in the future
I0825 15:30:14.135980   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002' for gc 6.99999843111111days in the future
I0825 15:31:04.450660   978 slave.cpp:2873] Current usage 60.67%. Max allowed age: 2.053113079446458days

有没有人见过类似的东西吗?

Have anyone seen anything similar?

推荐答案

原来,这个问题并非由网络连接问题,但Mesos从回收政策所造成的这里概述:的 http://mesos.apache.org/documentation/latest/slave-recovery/

The problem turned out to be caused not by network connectivity issues but by Mesos slave recovery policy as outlined here: http://mesos.apache.org/documentation/latest/slave-recovery/

我将首先从设备连接到主,并断开它们,因为一个不相关的问题,但是当我后来又尝试过重新连接奴隶,他们被主人丢弃。引用上面链接的文档:

I would initially connect slaves to the master and disconnect them because of an unrelated problem, however when I later tried to connect the slaves again, they were being dropped by the master. To quote the documentation linked above:

重新启动的奴隶应以超时(目前,75秒)内的主重新注册。如果从时间超过此超时更长的时间来重新注册,主机关闭的奴隶,这反过来又关闭任何现场执行人/任务。因此,强烈建议重新启动自动从站的过程中(例如,使用monit的)。

A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using monit).

我用连接奴隶解决了这个问题的 - 严格选项设置为

I solved the problem by connecting the slaves with the --strict option set to false.

这篇关于只能从星火连接到Mesos在同一台机器上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆