使用DAG的Condor作业,其中一些作业需要运行同一主机 [英] Condor job using DAG with some jobs needing to run the same host
问题描述
我有一个计算任务,该任务分为多个独立的程序执行,并具有依赖性。我正在使用Condor 7作为任务调度程序(在Vanilla Universe中,由于对程序的限制超出了我的能力范围,因此不涉及检查点),因此DAG看起来是很自然的解决方案。但是,某些程序需要在同一主机上运行。在Condor手册中找不到如何执行此操作的参考。
I have a computation task which is split in several individual program executions, with dependencies. I'm using Condor 7 as task scheduler (with the Vanilla Universe, due do constraints on the programs beyond my reach, so no checkpointing is involved), so DAG looks like a natural solution. However some of the programs need to run on the same host. I could not find a reference on how to do this in the Condor manuals.
示例DAG文件:
JOB A A.condor
JOB B B.condor
JOB C C.condor
JOB D D.condor
PARENT A CHILD B C
PARENT B C CHILD D
我需要表示B和D必须在同一计算机节点上运行,而不会破坏B和C的并行执行。
I need to express that B and D need to be run on the same computer node, without breaking the parallel execution of B and C.
感谢您的帮助。
推荐答案
神鹰没有任何简单的解决方案,但是至少有一个应该有效的软键:
Condor doesn't have any simple solutions, but there is at least one kludge that should work:
B在执行节点上留下一些状态,可能是文件形式,它表示类似 MyJobRanHere = UniqueIdentifier
。使用 STARTD_CRON支持来检测D并在机器ClassAd中做广告。D使用 Requirements = MyJobRanHere == UniqueIdentifier
D的最终清理的一部分,或者可能是新节点E,删除状态。如果您要运行大量作业,则可能偶尔需要清除剩余状态。
Have B leave some state behind on the execute node, probably in the form of a file, that says something like MyJobRanHere=UniqueIdentifier"
. Use the STARTD_CRON support to detect this an advertise it in the machine ClassAd. Have D use Requirements=MyJobRanHere=="UniqueIdentifier"
. A part of D's final cleanup, or perhaps a new node E, it removes the state. If you're running large numbers of jobs through, you'll probably need to clean out left-over state occasionally.
这篇关于使用DAG的Condor作业,其中一些作业需要运行同一主机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!