Ray EC2群集上的工作程序节点状态:更新失败 [英] Worker node-status on a Ray EC2 cluster: update-failed

查看:169
本文介绍了Ray EC2群集上的工作程序节点状态:更新失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在有一个在EC2(Ubuntu 16.04)上运行的Ray群集,它具有一个c4.8xlarge主节点和一个相同的工作线程.我想检查是否正在使用多线程,因此我进行了测试,以增加同一9秒任务的数量(n).由于该实例具有18个CPU,因此我预计该作业将花费大约9s的时间,最多n <= 35(假设一个CPU用于集群管理),然后出现故障,或者切换到36个vCPU时将其增加到大约18秒每个节点.

I now have a Ray cluster working on EC2 (Ubuntu 16.04) with a c4.8xlarge master node and one identical worker. I wanted to check whether multi-threading was being used, so I ran tests to time increasing numbers (n) of the same 9-second task. Since the instance has 18 CPUs, I expected to see the job taking about 9s for up to n<=35 (assuming one CPU for the cluster management) and then either a fault, or an increase to about 18 sec when switching to 36 vCPUs per node.

相反,集群仅并行处理最多14个任务,然后执行时间跃升至40s,并随着n的增加而持续增加.当我尝试使用c4xlarge主服务器(4个CPU)时,时间与n成正比,即它们是串行运行的.因此,我推测主服务器实际上需要4个CPU用于该系统,而工作节点完全没有被使用.但是,如果我再增加一个工人,则n> 14的时间比没有它的时间少40s.我还尝试了target_utilization_factor的值小于1.0,但这没什么区别.

Instead, the cluster handled up to only 14 tasks in parallel and then the execution time jumped to 40s and continued to increase for increasing n. When I tried a c4xlarge master (4 CPUs), the times were directly proportional to n, i.e. they were running serially. So I surmise that the master actually requires 4 CPUs for the system, and that the worker node is not being used at all. However, if I add a second worker, the times for n>14 are about 40s less that without it. I also tried a value for target_utilization_factor less than 1.0, but that made no difference.

没有报告的错误,但是我确实注意到EC2实例控制台中工作程序的ray-node-status是更新失败".这重要吗?有人能启发我这种行为吗?

There were no reported errors, but I did notice that the ray-node-status for the worker in the EC2 Instances console was "update-failed". Is this significant? Can anyone enlighten me about this behaviour?

推荐答案

集群似乎没有使用工作程序,因此该跟踪仅显示处理该任务的18个实际cpus.监视器(ray exec ray_conf.yaml'tail -n 100 -f/tmp/ray/session_ /logs/monitor ')确定更新失败"的意义在于安装命令,在工作节点上,由ray updater.py调用的失败.具体地说,尝试在其上安装C编译必备编译器软件包的尝试可能超出了工作程序内存分配.我这样做只是为了禁止显示"setproctitle"安装警告-我现在可以理解,无论如何都可以安全地忽略该警告.

The cluster did not appear to be using the workers, so the trace is showing only 18 actual cpus dealing with the task. The monitor (ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_/logs/monitor') identified that the "update-failed" is significant in that the setup commands, called by the ray updater.py, were failing on the worker nodes. Specifically, it was the attempt to install the C build-essential compiler package on them that, presumably, exceeded the worker memory allocation. I was only doing this in order to suppress a "setproctitle" installation warning - which I now understand can be safely ignored anyway.

这篇关于Ray EC2群集上的工作程序节点状态:更新失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆