Slurm:使用来自多个节点的核心进行R并行化 [英] Slurm: Use cores from multiple nodes for R parallelization

查看:759
本文介绍了Slurm:使用来自多个节点的核心进行R并行化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Slurm调度程序并行化HPC上的R脚本.

I want to parallelize an R script on a HPC with a Slurm scheduler.

SLURM配置有SelectType: CR_Core_Memory.

每个计算节点具有16个内核(32个线程).

Each compute node has 16 cores (32 threads).

我使用 clustermq 作为Slurm的接口,使用以下配置将R脚本传递给SLURM

I pass the R script to SLURM with the following configuration using the clustermq as the interface to Slurm.

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=normal
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 2048 }}
#SBATCH --cpus-per-task={{ n_cpus }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --ntasks={{ n_tasks }}
#SBATCH --nodes={{ n_nodes }}

#ulimit -v $(( 1024 * {{ memory | 4096 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

在R脚本中,我使用30个内核进行多核"并行化. 我想使用来自多个节点的核心来满足30 cpus的要求,即,来自node1的16个核心,来自node2的14个核心.

Within the R script I do "multicore" parallelization with 30 cores. I would like to use cores from multiple nodes to satisfy the requirement of 30 cpus, i.e. 16 cores from node1, 14 from node2.

我尝试使用n_tasks = 2cpus-per-task=16.这样,作业将分配给两个节点.但是,只有一个节点在进行计算(在16个核上).第二个节点已分配给该作业,但不执行任何操作.

I tried using n_tasks = 2 and cpus-per-task=16. With this, the job gets assigned to two nodes. However, only one node is doing compuation (on 16 cores). The second node is assigned to the job but does nothing.

此问题中,srun用于在节点之间划分并行性使用foreach和Slurm ID.我既不使用srun也不使用foreach.有没有一种方法可以通过SBATCHmulticore并行性实现我想要的?

In this question srun is used to split parallelism across nodes with foreach and Slurm IDs. I do not neither use srun nor foreach. Is there a way to achieve what I want with SBATCH and multicore parallelism?

(我知道我可以使用SelectType=CR_CPU_Memory,并且每个节点有32个线程.但是,问题是通常如何使用多个节点中的内核/线程来扩大并行度).

(I know that I could use SelectType=CR_CPU_Memory and have 32 threads available per node. However, the question is how to use cores/threads from multiple nodes in general to be able to scale up parallelism).

推荐答案

我的评论摘要:

答案是您无法做到这一点,因为您的任务是在单个R进程中使用大量CPU.您要的是一个R进程来并行执行一个比物理计算机更多的CPU的任务.您不能将单个R进程拆分为多个节点.这些节点不共享内存,因此您不能将来自不同节点的CPU结合在一起,至少不能与典型的群集体系结构结合使用.如果您有像DCOS这样的分布式操作系统,则是可能的.

The answer is you cannot do this because your task is using a bunch of CPUs from within a single R process. You're asking a single R process to parallelize a task across more CPUs than the physical machine has. You cannot split a single R process across multiple nodes. Those nodes do not share memory, so you can't combine CPUs from different nodes, at least not with typical cluster architecture. It's possible if you had a distributed operating system like DCOS.

在您的情况下,解决方案是您需要做的是将工作拆分到这些R流程之外.运行2个(或3个或4个)单独的R进程,每个进程在其自己的节点上,然后将每个R进程限制为计算机具有的最大CPU数量.

In your case, the solution is that you need to do is split your job up outside of those R processes. Run 2 (or 3, or 4) separate R processes, each on its own node, and then restrict each R process to the maximum number of CPUs your machines have.

这篇关于Slurm:使用来自多个节点的核心进行R并行化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆