如何使用Slurm访问群集中不同节点上的GPU? [英] How to access to GPUs on different nodes in a cluster with Slurm?

查看:254
本文介绍了如何使用Slurm访问群集中不同节点上的GPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以访问由Slurm运行的群集,其中每个节点具有4个GPU。

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.

我有一个需要8 gpus的代码。

I have a code that needs 8 gpus.

所以问题是如何在每个节点只有4 gpu的群集上请求8 gpu?

所以这是我尝试通过 sbatch 提交的工作:

So this is the job that I tried to submit via sbatch:

#!/bin/bash
#SBATCH --gres=gpu:8              
#SBATCH --nodes=2               
#SBATCH --mem=16000M              
#SBATCH --time=0-01:00     

但是随后出现以下错误:

But then I get the following error:

sbatch: error: Batch job submission failed: Requested node configuration is not available    

然后我将设置更改为此并再次提交:

Then I changed my the settings to this and submitted again:

#!/bin/bash
#SBATCH --gres=gpu:4              
#SBATCH --nodes=2               
#SBATCH --mem=16000M              
#SBATCH --time=0-01:00  
nvidia-smi

及其结果仅显示4 gpu而非8。

and the result shows only 4 gpus not 8.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 0000:03:00.0     Off |                    0 |
| N/A   32C    P0    31W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 0000:04:00.0     Off |                    0 |
| N/A   37C    P0    29W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 0000:82:00.0     Off |                    0 |
| N/A   35C    P0    28W / 250W |      0MiB / 12193MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 0000:83:00.0     Off |                    0 |
| N/A   33C    P0    26W / 250W |      0MiB / 12193MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

谢谢。

推荐答案

Slurm不支持​​您所需要的。它只能分配给您的作业GPU /节点,而不能分配给GPU /集群。
因此,与CPU或其他消耗性资源不同,GPU不消耗性,而是绑定到承载它们的节点。

Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster. So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.

如果您对此主题感兴趣,有一项研究工作试图将GPU变成可消耗的资源,请查看本文
您将找到如何使用GPU虚拟化技术来做到这一点。

If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper. There you'll find how to do it using GPU virtualization technologies.

这篇关于如何使用Slurm访问群集中不同节点上的GPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆