将SIGTERM发送给正在运行的任务,分布式分发 [英] Send SIGTERM to the running task, dask distributed

查看:92
本文介绍了将SIGTERM发送给正在运行的任务,分布式分发的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我将一个小的Tensorflow培训作为单个任务提交时,它会启动其他线程.当我按下Ctrl+C并抬高KeyboardInterrupt时,我的任务已关闭,但基础线程未清理并继续训练.

When I submit a small Tensorflow training as a single task, it launches additional threads. When I press Ctrl+C and raise KeyboardInterrupt my task is closed but underlying threads are not cleaned up and training continues.

最初,我以为这是Tensorflow的问题(不是清洗线程),但是经过测试,我了解到问题来自Dask,可能不会再将SIGTERM信号填充到任务函数中.我的问题是,如何设置Dask以将SIGTERM信号填充到正在运行的任务?

Initially, I was thinking that this is a problem of Tensorflow (not cleaning threads), but after testing, I understand that a problem comes from the Dask side, that probably doesn't populate SIGTERM signal further to the task function. My question, how can I set Dask to populate SIGTERM signal to the running task?

所需流量示例:

本地进程->按Ctrl + C-> Dask调度程序-> Dask worker-> SIGTERM信号->使用Tensorflow训练运行单个任务.

Local process -> Press Ctrl + C -> Dask scheduler -> Dask worker -> SIGTERM signal -> Running single task with Tensorflow training.

谢谢.

P.S如果您需要其他信息,请询问.

P.S If you need additional information, just ask.

更新:

代码示例:

c = Client('<remote-scheduler>')

def task():
  # tensorflow training
  model = ...
  model.fit(x_train, y_train)

training = c.submit(task)
training.result()

现在,在训练期间,当我按Ctrl+C时,任务被取消,但tensorflow线程/进程仍然存在.

Now, during training, when I press Ctrl+C task is canceled, but tensorflow threads/processes remains.

更新2 : ps -f -u [username]命令输出.

集群集群(1个调度程序,1个工作器,同一台服务器),没有正在运行的任务:

Dask cluster (1 scheduler, 1 worker, same server), no running tasks:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16550 16547  0 12:40 ?        00:00:00 (sd-pam)
vladysl+ 16805 16311  0 12:40 ?        00:00:00 sshd: vladyslav@pts/45
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:24 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22284 22175  0 12:46 ?        00:00:00 sshd: vladyslav@pts/38
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:03 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:03:48 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 26150 22285  0 12:51 pts/38   00:00:00 ps -f -u vladyslav

任务运行期间:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:30 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:07:55 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27079 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

Ctrl+C后,任务被取消,但tensorflow继续工作:

After pressing Ctrl+C, task canceled but tensorflow continues working:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:31 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:09:32 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27117 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

如您所见,没有任何新内容出现.

As you can see nothing new appears.

推荐答案

Dask不支持从客户端传播到正在运行任务的工作人员的信号.

Dask does not support propagating signals from the client through to workers running tasks.

这篇关于将SIGTERM发送给正在运行的任务,分布式分发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆