CudaMallocManaged是否在设备上分配内存? [英] Does CudaMallocManaged allocate memory on the device?

查看:339
本文介绍了CudaMallocManaged是否在设备上分配内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用统一内存来简化对CPU和GPU上数据的访问。据我所知,cudaMallocManaged应该在设备上分配内存。我编写了一个简单的代码来检查:

I'm using Unified Memory to simplify access to data on the CPU and GPU. As far as I know, cudaMallocManaged should allocate memory on the device. I wrote a simple code to check that:

#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
  int ix = blockIdx.x * blockDim.x + threadIdx.x;
  int iy = blockIdx.y * blockDim.y + threadIdx.y;
  int in_idx = iy * dimx + ix; // index for reading input
  int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile  
  int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile 
  s_data[ty][tx] = g_input[in_idx];
  __syncthreads();
  g_output[in_idx] = s_data[ty][tx] * 1.3;
  }


int main(){
  int size_x = 16, size_y = 16;
  dim3 numTB;
    numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
    numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
  dim3 tbSize; 
  tbSize.x = BDIMX;
  tbSize.y = BDIMY;
  float* a,* a_out;
  cudaMallocManaged((void**)&a,     size_x * size_y * sizeof(TYPE));
  cudaMallocManaged((void**)&a_out, size_x * size_y * sizeof(TYPE));

  kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
    cudaDeviceSynchronize();
  return 0;
}

所以我什至没有访问CPU上的数据来避免任何页面故障,因此内存应该位于设备内存中。但是,当我在此代码上运行nvprof时,会得到以下结果:

So I'm not even accessing the data on the CPU to avoid any page faults so the memory should supposedly be on the device memory. However when I run nvprof on this code, I get the following results:

  invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K40c (0)"
Kernel: kernel(float*, float*, int, int)
        1                   local_load_transactions                   Local Load Transactions           0           0           0
        1                  local_store_transactions                  Local Store Transactions           0           0           0
        1                  shared_load_transactions                  Shared Load Transactions           8           8           8
        1                 shared_store_transactions                 Shared Store Transactions           8           8           8
        1                          gld_transactions                  Global Load Transactions           8           8           8
        1                          gst_transactions                 Global Store Transactions           8           8           8
        1                  sysmem_read_transactions           System Memory Read Transactions          32          32          32
        1                 sysmem_write_transactions          System Memory Write Transactions          34          34          34
        1                    tex_cache_transactions                Texture Cache Transactions           0           0           0
        1                    dram_read_transactions           Device Memory Read Transactions           0           0           0
        1                   dram_write_transactions          Device Memory Write Transactions           0           0           0

因此,显然,该阵列是在系统内存而不是设备内存上分配的。我在这里缺少什么?

So apparently the array is allocated on system memory and not the device memory. What am I missing here?

推荐答案

托管内存确实确实在GPU上分配了物理内存。您可以通过对代码执行以下操作来确认自己是这种情况:

Managed memory really does allocate physical memory on the GPU. You can confirm yourself this is the case by doing something like the following to your code:

#include <iostream>

void report_gpu_mem()
{
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    std::cout << "Free = " << free << " Total = " << total <<std::endl;
}

int main()
{
    float* a,* a_out;
    size_t sz = 1 << 24; // 16Mb
    report_gpu_mem();
    cudaMallocManaged((void**)&a, sz);
    report_gpu_mem();
    cudaMallocManaged((void**)&a_out, sz);
    report_gpu_mem();
    cudaFree(a);
    report_gpu_mem();
    cudaFree(a_out);
    report_gpu_mem();
    return cudaDeviceReset();
}

现在为两个托管分配中的每个分配16Mb,然后释放它们。没有主机或设备访问发生,因此不应有触发的传输或同步。该大小足够大,应超过GPU内存管理器的最小粒度,并触发可见的可用内存中的更改。编译并运行它是这样的:

Which is now allocating 16Mb for each of two managed allocations, and then freeing them. No host or device access occurs, so there should be no triggered transfers or synchronisation. The size is large enough that it should exceed the minimum granularity of the GPU memory manager and trigger changes in visible free memory. Compiling and running it does this:

$ nvcc -arch=sm_52 sleepy.cu 
$ CUDA_VISIBLE_DEVICES="0" ./a.out 
Free = 4211929088 Total = 4294770688
Free = 4194869248 Total = 4294770688
Free = 4178092032 Total = 4294770688
Free = 4194869248 Total = 4294770688
Free = 4211654656 Total = 4294770688

GPU上的物理可用内存显然正在以16Mb的速率增加和减少每个分配/免费。

The physical free memory on the GPU is clearly being incremented and decremented by 16Mb at each alloc/free.

这篇关于CudaMallocManaged是否在设备上分配内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆