CUDA全局内存访问速度 [英] CUDA global memory access speed

查看:137
本文介绍了CUDA全局内存访问速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是简单的cuda代码.
我正在测试访问全局内存的时间.阅读并正确.

下面是内核函数(test1()).

enter code here

__global__ void test1(int *direct_map)   
{  
    int index = 10;  
    int index2;  

    for(int j=0; j<1024; j++)  
    {  
        index2 = direct_map[index];  
        direct_map[index] = -1;  
        index = index2;  
    }  
}  

direct_map是683 * 1024线性矩阵,每个像素都有一个偏移值可以访问其他像素.
index和index2不是继续地址.

此内核功能大约需要 600 微秒.

但是,如果我删除代码,
direct_map [index] = -1;

仅需 27 微秒.

我认为代码已经从

的全局内存中读取了direct_map [index]的值

index2 = direct_map [index];

然后,它应该位于L2缓存中.
因此,当执行"direct_map [index] = -1;"时,速度应该很快.

而且,我测试了对全局内存的随机写入(test2()).

大约需要 120 微秒.

enter code here

__global__ void test2(int *direct_map)   
{  
    int index = 10;  

    for(int j=0; j<1024; j++)  
    {  
        direct_map[index] = -1;  
        index = j*683 + j/3 - 1;  
    }  
}

所以,我不知道为什么test1()会花费超过600微秒的时间. 谢谢.

解决方案

删除代码行时:

direct_map[index] = -1; 

您的内核没有执行有用的的任何操作.编译器可以识别出这一点,并消除了与内核启动相关的大多数代码.从编译器的角度来看,对内核代码的修改意味着内核不再影响任何 global状态,并且该代码实际上是无用的.

您可以通过转储编译器在每种情况下生成的汇编代码(例如,使用cuobjdump -sass myexecutable

)来验证这一点.

每当您对代码进行较小的更改并在时序上看到较大的更改时,您都应该怀疑所做的更改已使编译器做出不同的优化决策.

here is simple cuda code.
I am testing the time of accessing global memory. read and right.

below is kernel function(test1()).

enter code here

__global__ void test1(int *direct_map)   
{  
    int index = 10;  
    int index2;  

    for(int j=0; j<1024; j++)  
    {  
        index2 = direct_map[index];  
        direct_map[index] = -1;  
        index = index2;  
    }  
}  

direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.

this kernel function needs about 600 micro second.

But, if i delete the code,
direct_map[index] = -1;

just takes 27 micro second.

I think the code already read the value of direct_map[index] from global memory from

index2 = direct_map[index];

then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.

And, I tested random writing to global memory(test2()).

It takes about 120 micro seconds.

enter code here

__global__ void test2(int *direct_map)   
{  
    int index = 10;  

    for(int j=0; j<1024; j++)  
    {  
        direct_map[index] = -1;  
        index = j*683 + j/3 - 1;  
    }  
}

So, I don't know why test1() takes over than 600 micro seconds. thank you.

解决方案

When you delete the code line:

direct_map[index] = -1; 

your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.

You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable

Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.

这篇关于CUDA全局内存访问速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆