使用CUDA Profiler nvprof进行内存访问 [英] Using CUDA Profiler nvprof for memory accesses

查看:438
本文介绍了使用CUDA Profiler nvprof进行内存访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用nvprof来获取以下CUDA代码的全局内存访问次数。内核中的负载数为36(访问d_In数组),内核中的存储数为36 + 36(用于访问d_Out数组和d_rows数组)。因此,全局内存加载的总数为36,全局内存存储的数量为72。但是,当我使用nvprof CUDA探查器对代码进行探查时,它将报告以下内容:(基本上,我想计算对全局内存访问的计算(CGMA)比率)

I'm using nvprof to get the number of global memory accesses for the following CUDA code. The number of loads in the kernel is 36 (accessing d_In array) and the number of stores in the kernel is 36+36 (for accessing d_Out array and d_rows array). So, the total number of global memory loads is 36 and the number of global memory stores is 72. However, when I profile the code with nvprof CUDA profiler, it reports the following: (Basically I want to compute the Compute to Global Memory Access (CGMA) ratio)

      1                gld_transactions        Global Load Transactions           6           6           6
      1                gst_transactions       Global Store Transactions          11          11          11
      1            l2_read_transactions            L2 Read Transactions         133         133         133
      1           l2_write_transactions           L2 Write Transactions          24          24          24


#include <stdio.h>
#include "cuda_profiler_api.h"
__constant__ int crows;

__global__ void kernel(double *d_In, double *d_Out, int *d_rows){
        int tx=threadIdx.x;
        int bx=blockIdx.x;
        int n=bx*blockDim.x+tx;
        if(n < 36){
                d_Out[n]=d_In[n]+1;
                d_rows[n]=crows;
        }
        return;
}

int main(int argc,char **argv){

     double I[36]={1,5,9,2,6,10,3,7,11,4,8,12,13,17,21,14,18,22,15,19,23,16,20,24,25,29,33,26,30,34,27,31,35,28,32,36};

     double *d_In;
     double *d_Out;
     int *d_rows;

     double Iout[36];
     int rows=5;
     int h_rows[36];

     cudaMemcpyToSymbol(crows,&rows,sizeof(int));
     cudaMalloc(&d_In,sizeof(double)*36);
     cudaMalloc(&d_Out,sizeof(double)*36);
     cudaMalloc(&d_rows,sizeof(int)*36);

     cudaMemcpy(d_In,I,sizeof(double)*36,cudaMemcpyHostToDevice);

     dim3 dimGrid(4,1,1);
     dim3 dimBlock(10,1,1);

     cudaProfilerStart();
     kernel<<<dimGrid,dimBlock>>>(d_In,d_Out,d_rows);
     cudaProfilerStop();

     cudaMemcpy(Iout,d_Out,sizeof(double)*36,cudaMemcpyDeviceToHost);
      cudaMemcpy(h_rows,d_rows,sizeof(int)*36,cudaMemcpyDeviceToHost);


    int i;
     for(i=0;i<36;i++)
       printf("%f %d\n",Iout[i],h_rows[i]);


}

有人可以帮助我吗?谢谢

Can someone help me? Thank you

推荐答案

习惯上问一个问题,比有人可以帮我吗?更具体。如图所示,您的代码没有浮点运算(+,*等),因此没有要计算的CGMA(它为零)。

It's customary to ask a question, something more specific than "Can someone help me?" Your code as shown has no floating point operations (+, *, etc.) so there is no CGMA to compute (it is zero).

关于内存事务,您的代码有4个线程块:

Regarding the memory transactions, your code has 4 threadblocks:

 dim3 dimGrid(4,1,1);

每个线程块都可以在单独的多处理器上运行。每个块中有10个线程。下面的代码行:

Each threadblock may run on a separate multiprocessor. You have 10 threads in each block. The following line of code:

            d_Out[n]=d_In[n]+1;

将产生至少一个全局加载事务( d_In )和一项全局存储事务( d_Out )来服务线程。第四个块将包含线程,其活动线程的全局索引( n )为30-35。当该块执行上面的代码行时,它将生成 two 全局加载和 two 全局存储事务,因为线程需要两条缓存行来满足其请求。因此,这行代码可能会生成5个全局负载事务和5个全局商店事务。

will generate at least one global load transaction (d_In) and one global store transaction (d_Out) to service the threads. The fourth block will have threads whose global indices (n) for the active threads will be 30-35. When this block executes the above line of code, it will generate two global load and two global store transactions, because the threads require two cachelines to service their requests. So this one line of code may generate 5 global load transactions and 5 global store transactions.

出于类似的原因,下一行代码:

For similar reasons, the next line of code:

            d_rows[n]=crows;

可能会产生5次其他全球商店交易。因此,您的分析器输出为:

may generate 5 additional global store transactions. So of your profiler output:

  1                gld_transactions        Global Load Transactions           6           6           6
  1                gst_transactions       Global Store Transactions          11 

我相信我已经解释了6笔全球装载交易中的5笔,以及11笔全球商店交易中的10笔。希望这足以让您了解这些数字的起源。

I believe I have explained 5 of the 6 global load transactions, and 10 of the 11 global store transactions. Hopefully that is enough to give you an idea of the origin of these numbers.

这篇关于使用CUDA Profiler nvprof进行内存访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆