GNU parallel --jobs选项在群集上使用多个节点,每个节点具有多个cpus [英] GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

查看:321
本文介绍了GNU parallel --jobs选项在群集上使用多个节点,每个节点具有多个cpus的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用gnu并行在每个节点有2个CPU的高性能(HPC)计算群集上启动代码.该集群使用TORQUE便携式批处理系统(PBS).我的问题是澄清在这种情况下GNU parallel的--jobs选项如何工作.

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.

当我运行不带--jobs选项的,调用GNU parallel的PBS脚本时,如下所示:

When I run a PBS script calling GNU parallel without the --jobs option, like this:

#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

看起来每个内核仅使用一个CPU,并且还提供以下错误流:

it looks like it only uses one CPU per core, and also provides the following error stream:

bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.

对于每个节点来说,这似乎是一个错误.我不了解第一部分(bash: parallel: command not found),但是第二部分告诉我它正在使用一个节点.

This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.

当我在并行调用中添加选项-j2时,错误消失了,我认为每个节点使用两个CPU.我仍然是HPC的新手,因此,检查该问题的方法是从我的代码中输出日期时间戳(虚拟matlab代码需要10秒钟的时间才能完成).我的问题是:

When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:

  1. 我正确使用了--jobs选项吗?指定-j2是否正确,因为每个节点有2个CPU?还是我应该使用-jN,其中N是CPU的总数(节点数乘以每个节点的CPU数)?
  2. 看来,GNU并行尝试自行确定每个节点的CPU数量.有什么方法可以使它正常工作吗?
  3. bash: parallel: command not found消息是否有意义?
  1. Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
  2. It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
  3. Is there any meaning to the bash: parallel: command not found message?

推荐答案

  1. 是:-j是每个节点的作业数.
  2. 是的:在远程主机的$ PATH中安装并行".
  3. 是:这是$ PATH中缺少parallel的结果.
  1. Yes: -j is the number of jobs per node.
  2. Yes: Install 'parallel' in your $PATH on the remote hosts.
  3. Yes: It is a consequence from parallel missing from the $PATH.

GNU Parallel登录到远程计算机;尝试确定失败的核心数量(使用parallel --number-of-cores),然后默认为每个主机1个CPU核心.通过提供-j2,GNU Parallel不会尝试确定内核数.

GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.

您知道吗,您还可以将--sshlogin中的内核数指定为:4/myserver?如果您混合使用具有不同内核数量的计算机,这将很有用.

Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.

这篇关于GNU parallel --jobs选项在群集上使用多个节点,每个节点具有多个cpus的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆