将全局复制到共享内存的最佳方法 [英] Best way to copy global into shared memory

查看:240
本文介绍了将全局复制到共享内存的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有32个线程块,需要随机访问1024个元素数组。我想通过最初将块从全局转移到共享来减少全局内存调用的次数。我有两个想法可以解决:



A:

  my_kernel()
{
CopyFromGlobalToShared(1024/32个元素);
UseSharedMemory();
}

或B:

  my_kernel()
{
if(块中的第一个线程)
{
CopyFromGlobalToShared(所有元素);
}
UseSharedMemory();
}

哪个更好?还是有另一种更好的方法?

解决方案

A更好。



<与CPU相比,GPU具有更高的内存带宽。但是,只有在GPU中运行的线程遵循特定模式时才能达到峰值带宽。



此模式要求将mem访问合并。这意味着您需要使用多个线程来访问全局内存中的顺序地址,并特别注意对齐方式。



您可以找到有关合并访问全局的更多详细信息CUDA文档中的内存。



http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-global-memory


Let's say I have a block of 32 threads that need to do random access a 1024 element array. I want to reduce the number of global memory calls by initially transferring the block from global to shared. I have two ideas to go about it:

A:

my_kernel()
{
    CopyFromGlobalToShared(1024 / 32 elements);
    UseSharedMemory();
}

or B:

my_kernel()
{
    if (first thread in block)
    {
        CopyFromGlobalToShared(all elements);
    }
    UseSharedMemory();
}

Which is better? Or is there another, better method?

解决方案

A is better.

GPU has much higher mem bandwidth compared to CPU. However the peak bandwidth can only be achieved when threads running in the GPU follow a certain pattern.

This pattern requires the mem access to be coalesced. This means you need to use multiple threads to access sequential addresses in the global mem, and pay special attention on the alignment.

You could find more details about Coalesced Access to Global Memory in CUDA docs.

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#coalesced-access-global-memory

这篇关于将全局复制到共享内存的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆