如何在Infiniband中使用GPUDirect RDMA [英] How to use GPUDirect RDMA with Infiniband

查看:897
本文介绍了如何在Infiniband中使用GPUDirect RDMA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两台机器.每台机器上有多个Tesla卡.每台机器上还有一个InfiniBand卡.我想通过InfiniBand在不同机器上的GPU卡之间进行通信.只是点对点单播就可以了.我当然想使用GPUDirect RDMA,这样我就可以避免进行额外的复制操作.

I have two machines. There are multiple Tesla cards on each machine. There is also an InfiniBand card on each machine. I want to communicate between GPU cards on different machines through InfiniBand. Just point to point unicast would be fine. I surely want to use GPUDirect RDMA so I could spare myself of extra copy operations.

我知道Mellanox现在提供了驱动程序. InfiniBand卡.但是它没有提供详细的开发指南.我也知道OpenMPI支持我要的功能.但是,OpenMPI对于这项简单的任务而言过于繁重,并且无法在单个进程中支持多个GPU.

I am aware that there is a driver available now from Mellanox for its InfiniBand cards. But it doesn't offer a detailed development guide. Also I am aware that OpenMPI has support for the feature I am asking. But OpenMPI is too heavy weight for this simple task and it does not support multiple GPUs in a single process.

我想知道是否可以直接使用驱动程序进行通讯获得任何帮助.代码示例,教程,一切都会很好.另外,如果有人可以帮助我在OpenMPI中找到处理此问题的代码,我将不胜感激.

I wonder if I could get any help with directly using the driver to do the communication. Code sample, tutorial, anything would be good. Also, I would appreciate it if anyone could help me find the code dealing with this in OpenMPI.

推荐答案

为使GPUDirect RDMA正常工作,您需要安装以下软件:

For GPUDirect RDMA to work, you need the following installed:

已安装最新的NVIDIA CUDA套件

Recent NVIDIA CUDA suite installed

应按照上面列出的顺序安装以上所有组件,并加载相关的模块. 之后,您应该能够注册在GPU视频内存上分配的用于RDMA事务的内存.示例代码如下:

All of the above should be installed (by the order listed above), and the relevant modules loaded. After that, you should be able to register memory allocated on the GPU video memory for RDMA transactions. Sample code will look like:

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

这将在启用GPUDirect RDMA的系统上创建一个内存区域,其中包含一个有效的内存密钥,您可以将其用于与我们的HCA进行RDMA事务.

This will create (on a GPUDirect RDMA enabled system) a memory region, with a valid memory key that you can use for RDMA transactions with our HCA.

有关在代码中使用RDMA和InfiniBand动词的更多详细信息,可以参考以下

For more details about using RDMA and InfiniBand verbs in your code, you can refer to this document.

这篇关于如何在Infiniband中使用GPUDirect RDMA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆