nvprof事件“fb_subp0_read_sectors”和“fb_subp1_read_sectors”不报告正确的结果 [英] nvprof events "fb_subp0_read_sectors" and "fb_subp1_read_sectors" do not report correct results

查看:296
本文介绍了nvprof事件“fb_subp0_read_sectors”和“fb_subp1_read_sectors”不报告正确的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图计算简单向量添加内核的DRAM(全局内存)访问次数。

  __ global__ void AddVectors(const float * A,const float * B,float * C,int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex +(N * blockDim.x);
int i;

for(i = threadStartIndex; i< threadEndIndex; i + = blockDim.x){
C [i] = A [i] + B [i]
}
}



网格尺寸= 180
块大小= 128



数组大小= 180 * 128 * N float其中N是输入参数(每个线程的元素)



当N = 1时,数组大小= 180 * 128 * 1浮点数= 90KB



所有数组A,B和C都应从DRAM中读取。因此,理论上,



DRAM写入(C)= 2880(32字节存取)
DRAM读B)= 2880 + 2880 = 5760(32字节访问)



但是当我使用nvprof



= fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880(32字节访问)
DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30(32字节访问)



现在这是问题。理论上应该有5760 DRAM读取,但nvprof只报告30,对我来说这看起来不可能。此外,如果将矢量的大小加倍(N = 2),仍然报告的DRAM访问保持在30。



如果有人可以卸下一些光。



我已通过使用编译器选项 -Xptxas -dlcm = cg / p>

感谢,
Waruna

解决方案

cudaMemcpy 在内核启动之前将源缓冲区从主机复制到设备,从而获得L2缓存中的源缓冲区,因此内核不会从L2读取任何遗漏



如果您注释掉,那么您将获得更少的数量。 fb_subp0_read_sectors + fb_subp1_read_sectors cudaMemcpy 在内核启动之前,您将看到 fb_subp0_read_sectors fb_subp1_read_sectors 包括您期望的值。


I tried to count the number of DRAM (global memory) accesses for simple vector add kernel.

__global__ void AddVectors(const float* A, const float* B, float* C, int N)
{
    int blockStartIndex  = blockIdx.x * blockDim.x * N;
    int threadStartIndex = blockStartIndex + threadIdx.x;
    int threadEndIndex   = threadStartIndex + ( N * blockDim.x );
    int i;

    for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
        C[i] = A[i] + B[i];
    }
}

Grid Size = 180 Block size = 128

size of array = 180 * 128 * N floats where N is input parameter (elements per thread)

when N = 1, size of array = 180 * 128 * 1 floats = 90KB

All arrays A, B and C should be read from DRAM.

Therefore theoretically,

DRAM writes (C) = 2880 (32 byte accesses) DRAM reads (A,B) = 2880 + 2880 = 5760 (32 byte accesses)

But when I used nvprof

DRAM writes = fb_subp0_write_sectors + fb_subp1_write_sectors = 1440 + 1440 = 2880 (32 byte accesses) DRAM reads = fb_subp0_read_sectors + fb_subp1_read_sectors = 23 + 7 = 30 (32 byte accesses)

Now this is the problem. Theoretically there should be 5760 DRAM reads, but nvprof only reports 30, for me this looks impossible. Further more, if you double the size of the vector (N = 2), still the reported DRAM accesses remains at 30.

It would be great, if someone can shed some light.

I have disabled the L1 cache by using compiler option "-Xptxas -dlcm=cg"

Thanks, Waruna

解决方案

If you have done cudaMemcpy before the kernel launch to copy the source buffers from host to device, that gets the source buffers in L2 cache and hence the kernel doesn't see any misses from L2 for reads and you get less number of (fb_subp0_read_sectors + fb_subp1_read_sectors).

If you comment out cudaMemcpy before the kernel launch, you will see that the event values of fb_subp0_read_sectors and fb_subp1_read_sectors include the values you are expecting.

这篇关于nvprof事件“fb_subp0_read_sectors”和“fb_subp1_read_sectors”不报告正确的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆