Cuda更快的通过值? [英] Cuda faster passing values?

查看:318
本文介绍了Cuda更快的通过值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我尝试通过传递一个值给内核来使Cuda应用程序更具动态性。应用程序调用多个内核,并最大化块和网格大小。当我尝试运行应用程序时,这些是我得到的结果:




  • 硬编码值:.96秒

  • 在内核初始化时传递值:3.48秒

  • 声明一个 __ device__ int ,并将其设置为值:3.48



在执行时输入值后,程序的其余部分将保持不变。



这两个3.48秒的时间来自访问变量本身。如果我用一个硬编码的整数替换变量,运行时会大幅削减。这个值是非常频繁的访问,我想知道是否有一种方法来保持速度类似于硬编码值,但降低访问变量的成本。是否可以通过使用一个变量来加快这个速度?



3.6倍慢的重要吗?有点。这只是一小部分大得多的东西。



任何帮助将非常感激。



2.0硬件。



编辑:这是我遇到的差异的一个例子:

 code> #includecuda_runtime.h
#includedevice_launch_parameters.h
#include< stdio.h>
#include< ctime>

using namespace std;

clock_t start;

__device__ int x;

__global__ void setNum(int i)
{
x = i;
return;
}

__device__ void d_swap(int * a,int * b)
{
int temp = * a;
* a = * b;
* b = temp;
}

__device__ void other(int n,int * vec)
{
int i;
for(i = 0; i for(int j = 0; j <5; j ++)
for(i = 1; i d_swap(& vec [i ],& vec [i-1]);
}

__global__ void Pressure(int i)
{
int a [12];
other(x,a);
// other(12,a);
}

int main(int argc,char * argv [])
{
if(argc!= 2)
{
fprintf(stderr,Invalid number of arguments.\\\
);
exit(1);
}
int num = atoi(argv [1]);
cudaSetDevice(1);

cudaMemset(& x,num,sizeof(int));
setNum<<< 1,1>(num);

cudaError_t cuda_status = cudaDeviceSynchronize();
if(cuda_status!= cudaSuccess){
printf(No dice\\\
);
exit(1);
}
int results = 0;
cudaMemcpyFromSymbol(& results,x,sizeof(int));
printf(x of%i \\\
,results);

start = clock();
for(int i = 0; i <8; i ++)
压力<<<<<< 65535,1024>
cuda_status = cudaDeviceSynchronize();
printf(Result:%f \\\
,(float)(clock() - start)/ CLOCKS_PER_SEC);
return 0;
}

编译: nvcc -m64 -gencode arch = compute_20,code = sm_20 -o test test.cu



运行于: ./ test 12 (12设置x变量)



注意注释掉的代码块:



code> other(x,a); ,我得到 1.370000



运行其他(12,a); ,我得到 0.020000

解决方案

您尚未显示您的代码,因此我的评论本质上是一般的。



看到基于代码的相对较小的更改的执行时间的巨大差异,这是由于编译器可以优化的变化。所以有几个事情要考虑:


  1. 用变量替换常量的想法会导致这种级别的执行时间变化不可能对我,因为编译器有很多方法来优化访问经常使用的数据,即使它是动态/变量性质。您可能想要比较在每种情况下生成的PTX代码,以了解为什么会有这样的差异,并测试您的结论,即实际差异是由于重复访问。通常,编译器会检测到这一点(特别是对于未修改的值),并优化对寄存器的访问。

  2. 如果命令行选项的数量相对较小,可以考虑使用模板化内核,对于每个选项/选择具有不同的实例。这将导致内核有效地选择硬编码,因此它的性能应该大致相当于你更快的情况。

EDIT:由于您已经发布了一些代码,我将再补充几个注释。


  1. 您的计划中有错误,您没有正确抓住。请做适当的cuda错误检查,以避免任何由此造成的混乱。一个错误是您在使用 cudaMemset 时使用 __ device __ 符号。


  2. 您发布的代码中的差异是由于编译器优化。我不打算进行大量的分析,因为你发布的代码似乎基本上是无意义的代码。




    • 使用 -G 开关。两种情况之间的时间变得相同(对于我在大约7秒)。这将关闭所有编译器优化。没有优化,代码的执行时间基本相同。


    • 查看PTX输出。在这两种情况下,使用 -ptx 开关编译代码。在快速的情况下,在PTX文件结束时,我看到这个全局函数定义:

        .visible .entry _Z8Pressurei (
      .param .u32 _Z8Pressurei_param_0

      {



      .loc 2 52 2
      ret;
      }



这段代码在做什么 。它只是一个带有return语句的空函数。编译器完全优化了函数的行为。



在慢的情况下,压力函数有〜50行的实际代码。 (而且整个ptx文件要大得多。)


I'm trying to make a Cuda application a little more dynamic by passing a value to the kernel which comes from command line arguments.

The application calls multiple kernels, and maximizes block and grid size as well. When I try running the application, these are the results I get:

  • Hard coded value: .96 seconds
  • Passing a value at kernel initialization: 3.48 seconds
  • Declaring a __device__ int, and setting it to the value: 3.48 seconds

Once the value is entered at execution time, it will remain constant for the remainder of the program.

The two 3.48 second times come from access to the variable itself. If I were to replace the variable with a hard-coded integer, the runtime gets cut drastically. This value is accessed very frequently, and I was wondering if there's a way to keep the speed similar to the hard coded value, but reduce the cost of accessing the variable. Is it possible to speed this up by using a variable?

Is 3.6x slower important? Sort-of. This is only a small set of something much larger.

Any help would be greatly appreciated.

*running 2.0 hardware.

Edit: Here's an example of the difference I'm experiencing:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <ctime>

using namespace std;

clock_t start;

__device__ int x;

__global__ void setNum(int i)
{
        x = i;
        return;
}

__device__ void d_swap(int * a, int * b)
{
        int temp = *a;
        *a = *b;
        *b = temp;
}

__device__ void other(int n, int * vec)
{
        int i;
        for(i = 0; i < n; ++i) vec[i] = i;
        for (int j = 0; j < 5; j++)
                for(i = 1; i < n-1; ++i)
                        d_swap(&vec[i], &vec[i-1]);
}

__global__ void Pressure(int i)
{
        int a[12];
        other(x, a);
        //other(12,a);
}

int main(int argc, char * argv[])
{
        if (argc != 2)
        {
                fprintf(stderr,"Invalid number of arguments.\n");
                exit(1);
        }
        int num = atoi(argv[1]);
        cudaSetDevice(1);

        cudaMemset(&x, num, sizeof(int));
        setNum<<< 1 , 1>>>( num );

        cudaError_t cuda_status = cudaDeviceSynchronize();
        if (cuda_status != cudaSuccess) {
                printf("No dice\n");
                exit(1);
        }
        int results = 0;
        cudaMemcpyFromSymbol(&results, x, sizeof(int));
        printf("Value of x: %i\n", results);

        start = clock();
        for (int i = 0; i < 8; i++)
                Pressure<<<65535, 1024>>>(i);
        cuda_status = cudaDeviceSynchronize();
        printf("Result: %f\n", (float)(clock()-start)/CLOCKS_PER_SEC);
        return 0;
}

Compiled with: nvcc -m64 -gencode arch=compute_20,code=sm_20 -o test test.cu

Run with: ./test 12 (12 sets the x variable)

Note the commented out block of code:

Running other(x, a);, I get 1.370000

Running other(12,a);, I get 0.020000

解决方案

You haven't shown your code, so my comments are necessarily general in nature.

Often, when we see large differences in execution time based on relatively small changes to a code, it's due to changes in what the compiler can optimize. So there are several things to consider:

  1. The idea that replacement of a constant with a variable would result in this level of execution time change seems unlikely to me, because the compiler has plenty of ways to optimize access to frequently used data, even if it is dynamic/variable in nature. You might want to compare the PTX code that is generated in each case, to get an idea why there is such a difference, and to test your conclusion that the actual difference is due to repeated access. Normally the compiler will detect this (especially for values that are unmodified) and optimize the access into registers.
  2. If the number of command line options is relatively small, you could consider using a templated kernel, with different instantiations for each of the options/choices. This should result in a kernel that effectively has the choice hard-coded, and so it's performance should be roughly equivalent to your faster case.

EDIT: Since you've now posted some code, I'll make a few additional comments.

  1. You have errors in your program that you are not properly catching. Please do proper cuda error checking to avoid any confusion due to this. One error is in your usage of cudaMemset with a __device__ symbol.

  2. The discrepancy in the code you have posted is due to compiler optimization. I'm not going to go into a great deal of analysis on this because the code you have posted appears to be basically nonsense code. But there are two ways I can support this assertion.

    • Compile your code with the -G switch. The timing between the two cases becomes the same (for me at about 7 seconds). This turns off all compiler optimization. With no optimizations, the codes have essentially the same execution time.

    • Look at the PTX output. Compile your code, in both cases, with the -ptx switch. In the "fast" case, at the end of the PTX file I see this global function definition:

      .visible .entry _Z8Pressurei(
          .param .u32 _Z8Pressurei_param_0
      )
      {
      
      
      
      .loc 2 52 2
      ret;
      }
      

This code is doing nothing. It's simply an empty function with a return statement. The compiler completely optimized the function behavior away.

In the "slow" case, the pressure function has ~50 lines of actual code in it. (and the overall ptx file is much larger.)

这篇关于Cuda更快的通过值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆