为什么使用cudaMallocManaged时运行CUDA内核的NVIDIA Pascal GPU运行缓慢 [英] Why is NVIDIA Pascal GPUs slow on running CUDA Kernels when using cudaMallocManaged

查看:3786
本文介绍了为什么使用cudaMallocManaged时运行CUDA内核的NVIDIA Pascal GPU运行缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试新的CUDA 8以及Pascal Titan X GPU,并期待加快我的代码,但由于某种原因,它最终会变慢。



以下是可以重现结果的最低代码:



CUDASample.cuh

  class CUDASample {
public:
void AddOneToVector(std :: vector< int>& in);
};

CUDASample.cu

  __ global__ static void CUDAKernelAddOneToVector(int * data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;

data [y * mx + x] = data [y * mx + x] + 1.0f;
}

void CUDASample :: AddOneToVector(std :: vector< int>& in){
int * data;
cudaMallocManaged(reinterpret_cast< void **>(& data),
in.size()* sizeof(int),
cudaMemAttachGlobal);

for(std :: size_t i = 0; i< in.size(); i ++){
data [i] = in.at(i)
}

dim3 blks(in.size()/(16 * 32),1);
dim3 threads(32,16);

CUDAKernelAddOneToVector<<< blk,threads>>>>(data);

cudaDeviceSynchronize();

for(std :: size_t i = 0; i< in.size(); i ++){
in.at(i)= data [i]
}

cudaFree(data);
}

Main.cpp
$ b

  std :: vector< int> v; 

for(int i = 0; i <8192000; i ++){
v.push_back(i);
}

CUDASample cudasample;

cudasample.AddOneToVector(v);

唯一的区别是NVCC标志,对于Pascal Titan X来说是:

  -gencode arch = compute_61,code = sm_61-std = c ++ 11; 

而对于旧的Maxwell Titan X:

  -gencode arch = compute_52,code = sm_52-std = c ++ 11; 

编辑:以下是运行NVIDIA Visual Profiling的结果。



对于旧的Maxwell Titan,内存传输的时间约为205 ms,内核启动时间约为268 us。



对于Pascal Titan,内存传输时间约为202 ms,是一个疯狂的长8343我们,这让我相信一些是错的。



我通过将cudaMallocManaged替换为好的旧cudaMalloc来进一步隔离问题,有趣的结果。



CUDASample.cu

  __ global__ static void CUDAKernelAddOneToVector int * data)
{
const int x = blockIdx.x * blockDim.x + threadIdx.x;
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int mx = gridDim.x * blockDim.x;

data [y * mx + x] = data [y * mx + x] + 1.0f;
}

void CUDASample :: AddOneToVector(std :: vector< int>& in){
int * data;
cudaMalloc(reinterpret_cast< void **>(& data),in.size()* sizeof(int));
cudaMemcpy(reinterpret_cast< void *>(data),reinterpret_cast< void *>(in.data()),
in.size()* sizeof(int),cudaMemcpyHostToDevice);

dim3 blks(in.size()/(16 * 32),1);
dim3 threads(32,16);

CUDAKernelAddOneToVector<<< blk,threads>>>>(data);

cudaDeviceSynchronize();

cudaMemcpy(reinterpret_cast< void *>(in.data()),reinterpret_cast< void *>(data),
in.size()* sizeof(int),cudaMemcpyDeviceToHost );

cudaFree(data);
}

对于旧的Maxwell Titan,内存传输时间大约为5 ms方法,内核启动大约是264 us。



对于Pascal Titan,内存传输的时间大约为5 ms,内核启动大约194 us,这实际上导致我希望看到的性能增加...



为什么当使用cudaMallocManaged时,Pascal GPU运行CUDA内核的速度如此之慢?如果我必须将使用cudaMallocManaged的所有现有代码还原为cudaMalloc,这将是一个耻辱。这个实验还表明,使用cudaMallocManaged的内存传输时间比使用cudaMalloc慢得多,这也觉得有些错误。如果使用这个结果在一个慢的运行时间,即使代码更容易,这应该是不可接受的,因为使用CUDA而不是纯C ++的整个目的是加快速度。

解决方案

在使用Pascal GPU的CUDA 8下,管理的内存数据在统一存储器(UM)制度下的迁移通常会发生与以前的架构不同,并且您正在经历这种情况的影响。



特定内核调用使用的管理分配将在内核启动时立即全部迁移,大约好像您自己调用 cudaMemcpy 移动数据。

使用CUDA 8和Pascal GPU,数据迁移通过请求分页发生。在内核启动时,默认情况下,没有数据显式迁移到设备。当GPU设备代码试图访问不驻留在GPU存储器中的特定页面中的数据时,将发生页错误。这个页错误的净效果是:


  1. 导致GPU内核代码(访问页面的线程或线程)直到第2步完成)

  2. 导致内存页从CPU迁移到GPU

当GPU代码触摸各个页面的数据时,将根据需要重复该过程。除了实际移动数据所花费的时间之外,上述步骤2中涉及的操作序列涉及在处理页错误时的一些延迟。由于这个过程会一次移动一个页面,所以使用 cudaMemcpy 或者通过pre-Pascal UM来移动所有数据可能明显不太有效安排,导致所有数据在内核启动时移动(无论是否需要,无论内核代码实际需要什么)。



利弊,我不想辩论其优点或各种意见或观点。请求调页过程为Pascal GPU提供了许多重要的特性和功能。



然而,这个特定的代码示例并没有受益。这是预期的,所以推荐用于使行为与以前的(例如maxwell)行为/性能一致的是在内核启动之前使用 cudaMemPrefetchAsync() 呼叫。



将使用CUDA流语义强制此调用在内核启动之前完成(如果内核启动未指定流,则可以为流参数传递NULL,以选择默认流)。我相信这个函数调用的其他参数是不言而喻的。



在你的内核调用之前调用这个函数,覆盖有问题的数据,你不应该观察任何在Pascal案例中页面出错,并且配置文件行为应该类似于Maxwell的情况。



正如我在评论中提到的,如果你创建了一个测试用例涉及到两个内核调用顺序,你会发现第二个调用大约全速运行,即使在Pascal的情况下,因为所有的数据已经通过第一个内核执行迁移到GPU端。因此,使用此预取功能不应被视为强制或自动,但应谨慎使用。在某些情况下,GPU可能能够在某种程度上隐藏页错误的延迟,并且显然已经驻留在GPU上的数据不需要被预取。


I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.

Here is the minimum code that can reproduce the result:

CUDASample.cuh

class CUDASample{
 public:
  void AddOneToVector(std::vector<int> &in);
};

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMallocManaged(reinterpret_cast<void **>(&data),
                    in.size() * sizeof(int),
                    cudaMemAttachGlobal);

  for (std::size_t i = 0; i < in.size(); i++){
    data[i] = in.at(i);
  }

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  for (std::size_t i = 0; i < in.size(); i++){
    in.at(i) = data[i];
  }

  cudaFree(data);
}

Main.cpp

std::vector<int> v;

for (int i = 0; i < 8192000; i++){
  v.push_back(i);
}

CUDASample cudasample;

cudasample.AddOneToVector(v);

The only difference is the NVCC flag, which for the Pascal Titan X is:

-gencode arch=compute_61,code=sm_61-std=c++11;

and for the old Maxwell Titan X is:

-gencode arch=compute_52,code=sm_52-std=c++11;

EDIT: Here are the results for running NVIDIA Visual Profiling.

For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.

For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.

I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
  cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()), 
             in.size() * sizeof(int), cudaMemcpyHostToDevice);

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data), 
             in.size() * sizeof(int), cudaMemcpyDeviceToHost);

  cudaFree(data);
}

For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.

For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see...

Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

解决方案

Under CUDA 8 with Pascal GPUs, managed memory data migration under a unified memory (UM) regime will generally occur differently than on previous architectures, and you are experiencing the effects of this.

With previous architectures (e.g. Maxwell), managed allocations used by a particular kernel call will be migrated all at once, upon launch of the kernel, approximately as if you called cudaMemcpy to move the data yourself.

With CUDA 8 and Pascal GPUs, data migration occurs via demand-paging. At kernel launch, by default, no data is explicitly migrated to the device. When the GPU device code attempts to access data in a particular page that is not resident in GPU memory, a page fault will occur. The net effect of this page fault is to:

  1. Cause the GPU kernel code (the thread or threads that accessed the page) to stall (until step 2 is complete)
  2. Cause that page of memory to be migrated from the CPU to the GPU

This process will be repeated as necessary, as GPU code touches various pages of data. The sequence of operations involved in step 2 above involves some latency as the page fault is processed, in addition to the time spent to actually move the data. Since this process will move data a page at a time, it may be signficantly less efficient than moving all the data at once, either using cudaMemcpy or else via the pre-Pascal UM arrangement that caused all data to be moved at kernel launch (whether it was needed or not, and regardless of when the kernel code actually needed it).

Both approaches have their pros and cons, and I don't wish to debate the merits or various opinions or viewpoints. The demand-paging process enables a great many important features and capabilities for Pascal GPUs.

This particular code example, however, does not benefit. This was anticipated, and so the recommended use to bring the behavior in line with previous (e.g. maxwell) behavior/performance is to precede the kernel launch with a cudaMemPrefetchAsync() call.

You would use the CUDA stream semantics to force this call to complete prior to the kernel launch (if the kernel launch does not specify a stream, you can pass NULL for the stream parameter, to select the default stream). I believe the other parameters for this function call are pretty self-explanatory.

With this function call before your kernel call, covering the data in question, you should not observe any page-faulting in the Pascal case, and the profile behavior should be similar to the Maxwell case.

As I mentioned in the comments, if you had created a test case that involved two kernel calls in sequence, you would have observed that the 2nd call runs at approximately full speed even in the Pascal case, since all of the data has already been migrated to the GPU side through the first kernel execution. Therefore, the use of this prefetch function should not be considered mandatory or automatic, but should be used thoughtfully. There are situations where the GPU may be able to hide the latency of page-faulting to some degree, and obviously data already resident on the GPU does not need to be prefetched.

这篇关于为什么使用cudaMallocManaged时运行CUDA内核的NVIDIA Pascal GPU运行缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆