P2P复制的cudaMemcpy()和cudaMemcpyPeer()之间有什么区别? [英] What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy?

查看:758
本文介绍了P2P复制的cudaMemcpy()和cudaMemcpyPeer()之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想直接在不使用CPU-RAM的情况下将数据从GPU0-DDR复制到GPU1-DDR。



如第15页上所述: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf

 对等Memcpy 
从GPU A上的指针直接复制到GPU上的指针B

对于UVA,只需使用cudaMemcpy(…,cudaMemcpyDefault)
或cudaMemcpyAsync(…,cudaMemcpyDefault)

同样非UVA显式P2P副本:
cudaError_t cudaMemcpyPeer(void * dst,int dstDevice,const void * src,
int srcDevice,size_t count)
cudaError_t cudaMemcpyPeerAsync(void * dst,int dstDevice,
dust void * src,int srcDevice,size_t count,cuda_stream_t stream = 0)




  1. 如果我使用 cudaMemcpy() 然后我必须首先设置标记<$ c> cudaSetDeviceFlags(cudaDeviceMapHost)?

  2. 我是否必须使用 cudaMemcpy()指针,该指针是从函数 cudaHostGetDevicePointer(& ; uva_ptr,ptr,0)

  3. 函数 cudaMemcpyPeer() ,如果没有任何优势,为什么?


解决方案

统一虚拟寻址(UVA)为所有CPU和GPU内存启用一个地址空间,因为允许根据指针值确定物理内存位置。



具有UVA *的点对点内存$ b

如果可以进行UVA,则自CUDA起, cudaMemcpy 可用于点对点 memcpy 可以推断出哪个设备拥有了哪个内存。通常,使用UVA执行对等 memcpy 的说明如下:

  //检查参与的GPU之间的对等访问:
cudaDeviceCanAccessPeer(& can_access_peer_0_1,gpuid_0,gpuid_1);
cudaDeviceCanAccessPeer(& can_access_peer_1_0,gpuid_1,gpuid_0);

//启用参与的GPU之间的对等访问:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1,0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0,0);

// UVA内存副本:
cudaMemcpy(gpu0_buf,gpu1_buf,buf_size,cudaMemcpyDefault);

不使用UVA的对等内存



如果无法进行UVA,则通过 cudaMemcpyPeer 完成对等memcpy。这是一个示例

  //将设备0设置为当前
cudaSetDevice(0);
float * p0;
size_t size = 1024 * sizeof(float);
//在设备0上分配内存
cudaMalloc(& p0,size);
//将设备1设置为当前
cudaSetDevice(1);
浮动* p1;
//在设备1上分配内存
cudaMalloc(& p1,size);
//将设备0设置为当前
cudaSetDevice(0);
//在设备0上启动内核
MyKernel<< 1000,128>><(p0);
//将设备1设置为当前
cudaSetDevice(1);
//将p0复制到p1
cudaMemcpyPeer(p1,1,p0,0,size);
//在设备1上启动内核
MyKernel<< 1000,128>><(p1);如您所见,

在前一种情况下(可能使用UVA),您无需指定不同的指针指向哪个设备,在后一种情况下(不可能使用UVA),您必须明确提及指针指向的设备。



指令

  cudaSetDeviceFlags(cudaDeviceMapHost); 

用于启用主机到设备内存的映射,这是另一回事,并且考虑到主机<->设备内存移动而不是点对点内存移动,这是您发布的主题。



总而言之,您的问题的答案是:


  1. NO;

  2. NO;

  3. 尽可能启用UVA和使用 cudaMemcpy (无需指定设备);否则,请使用 cudaMemcpyPeer (并且您需要指定设备)。


I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.

As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf

Peer-to-Peer Memcpy
 Direct copy from pointer on GPU A to pointer on GPU B

 With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
     Or cudaMemcpyAsync(…, cudaMemcpyDefault)

 Also non-UVA explicit P2P copies:
     cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src, 
        int srcDevice, size_t count )
     cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
        const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )

  1. If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
  2. Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
  3. Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?

解决方案

Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.

Peer-to-peer memcpy with UVA*

When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:

//Check for peer access between participating GPUs: 
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);

//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);

//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);

Peer-to-peer memcpy without UVA

When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example

// Set device 0 as current
cudaSetDevice(0); 
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size); 
// Set device 1 as current
cudaSetDevice(1); 
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size); 
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0); 
// Set device 1 as current
cudaSetDevice(1); 
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size); 
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);

As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.

The instruction

cudaSetDeviceFlags(cudaDeviceMapHost);

is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.

In conclusion, the answer to your questions are:

  1. NO;
  2. NO;
  3. When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).

这篇关于P2P复制的cudaMemcpy()和cudaMemcpyPeer()之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆