CUDA性能:分支和共享内存 [英] CUDA performance: branching and shared memory

查看:118
本文介绍了CUDA性能:分支和共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想问两个关于表演的问题。我一直无法创建简单的代码来说明问题。

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.

问题1:非分支分支的价格是多少?在我的代码中,似乎甚至比4个非fma FLOPS还要多。请注意,我说的是BRA PTX代码,该谓词已经被计算

Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated

问题2:我已经阅读了很多有关共享内存性能的信息,有些文章如< href = http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731 rel = nofollow>多布斯博士的文章甚至指出与寄存器一样快(只要访问得当)。在我的代码中,块内扭曲中的所有线程都访问相同的共享变量。我相信在这种情况下可以通过广播模式访问共享内存,不是吗?是否应该以这种方式达到寄存器的性能?有什么特殊的事情应该考虑使其生效?

Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?

编辑:我已经能够构建一些简单的代码来为我的查询提供更多见识

I have been able to construct some simple code that give more insight for my query

这里是

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;

__global__ void test()
{
__shared__ int t[1024];
   int v=t[0];
    bool b=(v==-1);
    bool c=(v==-2);
    int myValue=0;
    for (int i=0;i<800;i++)
    {
#if 1
            v=i;
#else
            v=t[i];
#endif

#if 0
            if (b) {
                    printf("abs");
            }
#endif
            if (c)
            {
                    printf ("IT HAPPENED");
                    v=8;
            }
            myValue+=v;

    }
    if (myValue==1000)
            printf ("IT HAPPENED");



}
int main(int argc, char *argv[])
{
    cudaEvent_t event_start,event_stop;
    float timestamp;
float4  *data;
    // Initialise
    cudaDeviceReset();
    cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
 threadsPerBlock.x=32;
 threadsPerBlock.y=32;
 threadsPerBlock.z=1;
 blocks.x=1;
 blocks.y=1000;
 blocks.z=1;
 cudaEventCreate(&event_start);
 cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
    cudaEventRecord(event_stop, 0);
    cudaEventSynchronize(event_stop);
    cudaEventElapsedTime(&timestamp, event_start, event_stop);
    printf("Calculated in %f", timestamp);
}

我正在GTX680上运行此代码。

I am running this code on a GTX680.

现在结果如下:

如果按原样运行,则需要5.44毫秒

If run as it is it takes 5.44 ms

如果我将第一个#if条件更改为0(这将允许从共享内存中读取),则将花费6.02ms。。对我来说还不多,但仍然不够

If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me

如果启用第二个#if条件(插入一个永远不会求值为true的分支),它将在9.647040ms内运行。性能下降很大。原因是什么,该怎么办?

If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?

我也略微更改了代码以进一步检查共享内存

I have also changed slightly the code to make further checks with shared memory

而不是

__shared__ int t[1024]

我确实

__shared__ int2 t[1024] 

在我访问t []的任何地方,我只要访问t []。x。 In的性能进一步下降到了10ms。(又是400微秒)为什么会发生这种情况?

and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?

关于
Daniel

Regards Daniel

推荐答案

您确定内核是受计算限制还是受内存限制?如果您的内核是受计算限制的,那么您的第一个问题将是最相关的;如果您的内核是受内存限制的,则第二个问题将是最相关的。如果您假设其中一个,则可能会产生令人困惑或难以重现的结果。

Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.

(1)我不认为代价的分支已经发布。您可能需要根据您的体系结构进行实验确定。 CUDA编程指南确实说没有分支预测和投机执行。

(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."

(2)您正确的是,当您访问单个32位时扭曲中所有线程在共享内存中的值,将广播该值。但是我的猜测是,只要不引起任何银行冲突,从所有线程访问单个值的费用与访问任何值组合的费用相同。因此,您最终只能从共享内存中进行一次访存。我认为延迟的周期数尚未发布。它足够短,通常很容易被隐藏。

(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.

这篇关于CUDA性能:分支和共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆