cuda内核中的除法运算对每个线程的寄存器数量的影响 [英] Influence of division operation in cuda kernel on number of registers per thread

查看:265
本文介绍了cuda内核中的除法运算对每个线程的寄存器数量的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Im正在编写一个包含cuda内核的程序。我发现如果你使用 #define OPERATOR * 一个线程将使用11个寄存器,但我会使用 #define OPERATOR / (除法运算符)一个线程将使用52个寄存器!怎么了?我必须
减少寄存器号(我想要设置maxregcount)!在Cuda内核中使用devision操作符时,如何减少寄存器数量?

Im was writing a program which includes a cuda kernel. I found that if you are using#define OPERATOR * one thread will use 11 registers, but I you will use #define OPERATOR / (division operator) one thread will use 52 registers!! Whats wrong? I must decrease register number (I dot want to set maxregcount)! How can I decrease number of registers when Im using devision operator in cuda kernel?

#include <stdio.h>
#include <stdlib.h>
#define GRID_SIZE 1
#define BLOCK_SIZE 1
#define OPERATOR /
__global__ void kernel(double* array){
    for (int curEl=0;curEl<BLOCK_SIZE;++curEl){
    array[curEl]=array[curEl] OPERATOR 10;
    }
}
int main(void) {
    double *devPtr=NULL,*data=(double*)malloc(sizeof(double)*BLOCK_SIZE);
    cudaFuncAttributes cudaFuncAttr;
    cudaFuncGetAttributes(&cudaFuncAttr,kernel);
    for (int curElem=0;curElem<BLOCK_SIZE;++curElem){
        data[curElem]=curElem;
    }
    cudaMalloc(&devPtr,sizeof(double)*BLOCK_SIZE);
    cudaMemcpy(devPtr,data,sizeof(double)*BLOCK_SIZE,cudaMemcpyHostToDevice);
    kernel<<<1,BLOCK_SIZE>>>(devPtr);
    printf("1 thread needs %d regs\n",cudaFuncAttr.numRegs);
    return 0;
}


推荐答案

在内核计算中从双精度乘法切换到双精度除法是由于双精度乘法是内置的硬件指令的事实,而双精度除法是可调整的称为软件子程序(即,函数调用的排序)。这可以通过使用 cuobjdump --dump-sass 检查生成的机器代码(SASS)来轻松地验证。

The increase in register use when switching from a double-precision multiplication to a double-precision division in kernel computation is due to the fact that double-precision multiplication is a built-in hardware instruction, while double-precision division is a sizable called software subroutine (that is, a function call of sorts). This is easily verified by inspection of the generated machine code (SASS) with cuobjdump --dump-sass.

双精度除法(事实上所有除法,包括单精度除法和整数除法)由内联代码或称为子程序来模拟的原因是由于GPU硬件没有直接支持除法运算的事实,为了保持单个计算核心(CUDA核心)尽可能简单和尽可能小,这最终导致给定尺寸芯片的更高的峰值性能。它可能还会提高内核的效率,以GFLOPS /瓦特指标衡量。

The reason that double-precision divisions (and in fact all divisions, including single-precision division and integer division) are emulated either by inline code or called subroutines is due to the fact that the GPU hardware has no direct support for division operations, in order to keep the individual computational cores ("CUDA cores") as simple and as small as possible, which ultimately leads to higher peak performance for a given size chip. It likely also improves the efficiency of the cores as measured by the GFLOPS/watt metric.

对于发布版本,精度划分约为26个寄存器。这些附加寄存器需要在除法计算中存储中间变量,其中每个双精度临时变量需要两个32位寄存器。

For release builds, the typical increase in register use caused by the introduction of double-precision division is around 26 registers. These additional registers are needed to store intermediate variables in the division computation, where each double-precision temporary variable requires two 32-bit registers.

正如Marco13在评论中指出的以上,可以通过与倒数相乘来手动地替换除法。但是,这在大多数情况下导致轻微的数字差异,这就是为什么CUDA编译器不自动应用此转换。

As Marco13 points out in a comment above, it may be possible to manually replace division by multiplication with the reciprocal. However, this causes slight numerical differences in most cases, which is why the CUDA compiler does not apply this transformation automatically.

一般来说,注册使用可以通过 -maxrregcount nvcc编译器标志,或使用 __ launch_bounds __ function attribute 。然而,强制较低寄存器使用低于由编译器确定的级别的几个寄存器经常导致生成的代码中的寄存器溢出,这通常对内核性能具有负面影响。

Generally speaking, register use can be controlled with compilation-unit granularity through the -maxrregcount nvcc compiler flag, or with per-function granularity using the __launch_bounds__ function attribute. However, forcing lower register use by more than a few registers below the level determined by the compiler frequently leads to register spilling in the generated code, which usually has a negative impact on kernel performance.

这篇关于cuda内核中的除法运算对每个线程的寄存器数量的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆