GNU parallel --jobs选项在群集上使用多个节点,每个节点具有多个cpus [英] GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node
问题描述
我正在使用gnu并行在每个节点有2个CPU的高性能(HPC)计算群集上启动代码.该集群使用TORQUE便携式批处理系统(PBS).我的问题是澄清在这种情况下GNU parallel的--jobs选项如何工作.
I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
当我运行不带--jobs选项的,调用GNU parallel的PBS脚本时,如下所示:
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
看起来每个内核仅使用一个CPU,并且还提供以下错误流:
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
对于每个节点来说,这似乎是一个错误.我不了解第一部分(bash: parallel: command not found
),但是第二部分告诉我它正在使用一个节点.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found
), but the second part tells me it's using one node.
当我在并行调用中添加选项-j2
时,错误消失了,我认为每个节点使用两个CPU.我仍然是HPC的新手,因此,检查该问题的方法是从我的代码中输出日期时间戳(虚拟matlab代码需要10秒钟的时间才能完成).我的问题是:
When I add the option -j2
to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
- 我正确使用了
--jobs
选项吗?指定-j2
是否正确,因为每个节点有2个CPU?还是我应该使用-jN
,其中N是CPU的总数(节点数乘以每个节点的CPU数)? - 看来,GNU并行尝试自行确定每个节点的CPU数量.有什么方法可以使它正常工作吗?
-
bash: parallel: command not found
消息是否有意义?
- Am I using the
--jobs
option correctly? Is it correct to specify-j2
because I have 2 CPUs per node? Or should I be using-jN
where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)? - It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
- Is there any meaning to the
bash: parallel: command not found
message?
推荐答案
- 是:-j是每个节点的作业数.
- 是的:在远程主机的$ PATH中安装并行".
- 是:这是$ PATH中缺少
parallel
的结果.
- Yes: -j is the number of jobs per node.
- Yes: Install 'parallel' in your $PATH on the remote hosts.
- Yes: It is a consequence from
parallel
missing from the $PATH.
GNU Parallel登录到远程计算机;尝试确定失败的核心数量(使用parallel --number-of-cores
),然后默认为每个主机1个CPU核心.通过提供-j2
,GNU Parallel不会尝试确定内核数.
GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores
) which fails and then defaults to 1 CPU core per host. By giving -j2
GNU Parallel will not try to determine the number of cores.
您知道吗,您还可以将--sshlogin中的内核数指定为:4/myserver?如果您混合使用具有不同内核数量的计算机,这将很有用.
Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.
这篇关于GNU parallel --jobs选项在群集上使用多个节点,每个节点具有多个cpus的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!