优化具有非均匀节点的CUDA内核插值 [英] Optimizing CUDA kernel interpolation with nonuniform node points

查看：196 发布时间：2017/3/4 16:08:59 cuda interpolation

本文介绍了优化具有非均匀节点的CUDA内核插值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

ORIGINAL QUESTION

我有以下内核对非统一节点执行插值，我想优化它： p>

  __ global__无效插值（cufftDoubleComplex * Uj，double * points，cufftDoubleComplex * result，int N，int M）
 {
 int i = threadIdx.x + blockDim.x * blockIdx.x; 
 
 int PP; 
 double P; 
 const double alfa =（2.-1./cc）* pi_double-0.01; 
 double phi_cap_s; 
 cufftDoubleComplex temp; 
 
 double cc_points = cc * points [i]; 
 double r_cc_points = rint（cc * points [i]）; 
 
 temp = make_cuDoubleComplex（0.，0。）; 
 
 if（i  for（int m = 0; m <（2 * K + 1）; m ++）{
 P =（K * K- （cc_points-（r_cc_points + mK））*（cc_points-（r_cc_points + mK）））; phi_cap_s =（1./pi_double)*（（sinh（vala * sqrt（P）））/ sqrt 
 if（P <0。）phi_cap_s =（1./pi_double)*((sin(vg*sqrt(-P)))/sqrt(P））; 
 if（P == 0。）phi_cap_s = alfa / pi_double; 
 
 PP = modulo（（r_cc_points + m-K），（cc * N））; 
 temp.x = temp.x + phi_cap_s * Uj [PP] .x; 
 temp.y = temp.y + phi_cap_s * Uj [PP] .y; 
} 
 
 result [i] = temp; 
} 
}

K和cc是常数，点包含节点Uj是要内插的值。 modulo是一个基本上用作％的函数，但正确地扩展为负值。对于某种安排，内核调用需要2.3ms。我已验证最昂贵的部分是

  if（P> 0。）phi_cap_s =（1./pi_double）* （sinh（alfa * sqrt（P）））/ sqrt（P））; 
 if（P <0。）phi_cap_s =（1./pi_double)*((sin(vg*sqrt(-P)))/sqrt(P））; 
 if（P == 0。）phi_cap_s = alfa / pi_double;

大约占总时间的40％，

  PP = modulo（（r_cc_points + m-K），（cc * N））; 
 temp.x = temp.x + phi_cap_s * Uj [PP] .x; 
 temp.y = temp.y + phi_cap_s * Uj [PP] .y;

大约需要60％。通过Visual Profiler，我已经验证了前者的性能不受 if 语句的存在的影响。请注意，我想要双精度，所以我避免__exp（）解决方案。我怀疑，对于后者，随机存储器访问Uj [PP]可能负责这么多的计算百分比。任何建议的技巧或意见，以减少计算时间？提前感谢。

版本正在追踪评论和答案

在回答和注释中提供的代码如下：

  __ global__无效插值（cufftDoubleComplex * Uj，double * points ，cufftDoubleComplex * result，int N，int M）
 {
 int i = threadIdx.x + blockDim.x * blockIdx.x; 
 
 int PP; 
 double P，tempd; 
 const double alfa =（2.-1./cc）* pi_double-0.01; 
 cufftDoubleComplex temp = make_cuDoubleComplex（0.，0。）; 
 
 double cc_points = cc * points [i]; 
 double r_cc_points = rint（cc_points）; 
 
 cufftDoubleComplex rtemp [（2 * K + 1）]; 
 double phi_cap_s [2 * K + 1]; 
 
 if（i  #pragma unroll //展开循环
 for（int m = 0; m <（2 * K + 1）; m ++） {
 PP = modulo（（（int）r_cc_points + m -K），（cc * N））; 
 rtemp [m] = Uj [PP]; // 2 
 
 P =（K * K-（cc_points-（r_cc_points +（double）（m-K）））*（cc_points-（r_cc_points +（double）（m- 
 if（P <0）{tempd = rsqrt（-P）; phi_cap_s [m] =（1./pi_double)*((sin(alfa/tempd))*tempd）; } 
 else if（P> 0。）{tempd = rsqrt（P）; phi_cap_s [m] =（1./pi_double)*((sinh(alfa/tempd))*tempd）; } 
 else phi_cap_s [m] = alfa / pi_double; 
} 
 
 #pragma unroll //展开循环
 for（int m = 0; m <（2 * K + 1）; m ++）{
 temp .x = temp.x + phi_cap_s [m] * rtemp [m] .x; 
 temp.y = temp.y + phi_cap_s [m] * rtemp [m] .y; 
} 
 
 result [i] = temp; 
} 
}

特别是：
1）全局存储器变量Uj到大小为2 * K + 1的寄存器rtemp数组（K是在我的情况下等于6的常数）;
2）我将变量phi_cap_s移动到2 * K + 1大小的寄存器;
3）我使用if ... else语句，而不是之前使用的三个if（条件P <0和P> 0具有相同的出现概率）;
3）我为平方根定义了额外的变量;
4）我使用rsqrt而不是sqrt（只要我知道，sqrt（）由CUDA计算为1 / rsqrt（））;

我每次添加一个新功能一次，验证对原始版本的改进，但我必须说，没有人给我任何相关的改进。

执行速度受限于：
1）sin / sinh函数的计算（约40％的时间）;有什么办法用双精度算术计算它们，以某种方式利用内在的数学作为开始猜测？
2）由于映射索引PP，许多线程最终访问相同的全局存储器位置U i [PP]的事实;一个可能性，避免它将使用共享内存，但这将暗示一个强有力的线程合作。

我的问题是。我做了吗？也就是说，有没有什么意思来改进代码？我对NVIDIA Visual Profiler的代码进行了分析，结果如下：

  IPC = 1.939（计算能力2.1） 
全局内存负载效率= 38.9％; 
全局内存存储效率= 18.8％; 
经线执行效率= 97％; 
指令Replay Overhead = 0.7％;

最后，我想注意这个讨论与 CUDA：CUDA中的一维立方样条插值

使用共享内存的版本

我已经对使用共享内存进行了可行性研究。我已经考虑了 N = 64 ，使整个 Uj 适合共享内存。下面是代码（基本上是我的原始版本）

  __global__ void interpolation_shared（cufftDoubleComplex * Uj，double * points，cufftDoubleComplex * result ，int N，int M）
 {
 int i = threadIdx.x + blockDim.x * blockIdx.x; 
 
 int PP; 
 double P; 
 const double alfa =（2.-1./cc）* pi_double-0.01; 
 double phi_cap_s; 
 cufftDoubleComplex temp; 
 
 double cc_points = cc * points [i]; 
 double r_cc_points = rint（cc * points [i]）; 
 
 temp = make_cuDoubleComplex（0.，0。）; 
 
 __shared__ cufftDoubleComplex Uj_shared [128]; 
 
 if（threadIdx.x< cc * N）Uj_shared [threadIdx.x] = Uj [threadIdx.x]; 
 
 if（i  for（int m = 0; m <（2 * K + 1）; m ++）{
 P =（K * K- （cc_points-（r_cc_points + mK））*（cc_points-（r_cc_points + mK）））; phi_cap_s =（1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P）; bb 
 $ b if（P> 0。 
 if（P <0。）phi_cap_s =（1./pi_double)*((sin(vg*sqrt(-P)))/sqrt(P））; 
 if（P == 0。）phi_cap_s = alfa / pi_double; 
 
 PP = modulo（（r_cc_points + m-K），（cc * N））; 
 temp.x = temp.x + phi_cap_s * Uj_shared [PP] .x; 
 temp.y = temp.y + phi_cap_s * Uj_shared [PP] .y; 
} 
 
 result [i] = temp; 
} 
}

结果不会明显改善，取决于输入数组的小尺寸。

VERBOSE PTXAS OUTPUT

  ptxas：info：为'sm_20'编译条目函数'_Z13interpolationP7double2PdS0_ii'
 ptxas：info：_Z13interpolationP7double2PdS0_ii 
的函数属性352字节堆栈帧，0字节溢出存储，0字节溢出加载
 ptxas：info：使用55个寄存器，456字节累积堆栈大小，52字节cmem [0]

b $ b

第一个WARP和m = 0的值

  0.0124300933082964 
 0.0127183892149176 
 0.0135847002913749 
 0.0161796378170038 
 0.0155488126345702 
 0.0138890822153499 
 0.0121163187739057 
 0.0119998374528905 
 0.0131600831194518 
 0.0109574866163769 
 0.00962949548477354 
 0.00695850974164358 
 0.00446426651940612 
 0.00423369284281705 
 0.00632921297092537 
 0.00655137618976198 
 0.00810202954519923 
 0.00597974034698723 
 0.0076811348379735 
 0.00604267951733561 
 0.00402922460255439 
 0.00111841719893846 
 -0.00180949615796777 
 -0.00246283218698551 
 -0.00183256444286428 
 -0.000462696661685413 
 0.000725108980390132 
 -0.00126793006072035 
 0.00152263101649197 
 0.0022499598348702 
 0.00463681632275836 
 0.00359856091027666

MODULO FUNCTION >

  __ device__ int modulo（int val，int modulus）
 {
 if（val> 0）return val％modulus; 
 else 
 {
 int P =（-val）％modulus; 
 if（P> 0）return modulus -P; 
 else return 0; 
} 
}

根据回答优化的模块功能

  __ device__ int modulo（int val，int _mod）
 {
 if（val> 0）return val&（_ mod-1）; 
 else 
 {
 int P =（-val）&（_ mod-1）; 
 if（P> 0）return _mod -P; 
 else return 0; 
} 
}

解决方案

p> //上面的代码 cufftDoubleComplex rtemp [（2 * K + 1）] //如果它适合可用的寄存器，假设K是常数 if（i #pragma unroll //展开循环 for（int m = 0; m <（2 * K + 1）; m ++）{ $ b b PP = modulo（（r_cc_points + m-K），（cc * N））; rtemp [m] = Uj [PP]; // 2 } #pragma unroll for（nt m = 0; m <（2 * K + 1）; m ++）{ P =（K * K- （cc_points-（r_cc_points + mK））*（cc_points-（r_cc_points + mK）））; // 1 if（P> 0。）phi_cap_s =（1./pi_double)*((sinh(vg*sqrt(P)))/sqrt(P） else if（P <0）phi_cap_s =（1./pi_double)*（（sin（vara×sqrt（-P）））/ sqrt（-P））; else phi_cap_s = alfa / pi_double; temp.x = temp.x + phi_cap_s * rtemp [m] .x; // 3 temp.y = temp.y + phi_cap_s * rtemp [m] .y; } result [i] = temp; }

说明

添加了else if和else因为这些条件是互斥的，如果可以，你应该在出现概率后对语句排序。例如。如果P <0。

这会将请求的内存提取到多个寄存器，你之前做过的操作肯定会导致一个阻塞那个线程由于没有可用的内存在时间计算。并记住，如果一个线程在一个warp中阻塞，整个warp被阻塞。如果在就绪队列中没有足够的翘曲，程序将阻塞直到任何翘曲准备好。

我们现在已经将计算进一步向前推进，以便补偿不良的内存访问，希望以前的计算能够补偿错误的访问模式。 li>

这应该工作的原因如下：

即在GMEM中约> 400-600点钟。如果线程试图对在时间上不可用的存储器执行操作，则它将阻塞。这意味着如果每个存储器请求不存在于L1-L2中，则每个warp必须等待该时间或更多，直到它可以继续。

我怀疑是 temp.x + phi_cap_s * Uj [PP] .x 正在做的。通过暂存（步骤2）每个存储器传输到寄存器，并移动到下一个阶段，您将隐藏延迟，允许您在传输内存时进行其他工作。

到达步骤3时，希望内存可用，或者您必须等待更少的时间。

如果 rtemp 不适合寄存器以实现100％的占用你可能必须批量做。

您也可以尝试使 phi_cap_s 到数组中，并将其放入第一个循环中，如下所示：
#pragma unroll展开循环 for（int m = 0; m <（2 * K + 1）; m ++）{ //阶段存储器第一个 PP =模（（r_cc_points + m -K ），（cc * N））; rtemp [m] = Uj [PP]; // 2 P =（K * K-（cc_points-（r_cc_points + m-K））*（cc_points-（r_cc_points + m-K））） phi_cap_s [m] =（1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P）;如果（P> 0），则 // 1 。 else if（P <0）phi_cap_s [m] =（1./pi_double)*((sin(vg*sqrt(-P)))/sqrt(P））; else phi_cap_s [m] = alfa / pi_double; } #pragma unroll for（nt m = 0; m <（2 * K + 1）; m ++）{ temp.x = temp。 x + phi_cap_s [m] * rtemp [m] .x; // 3 temp.y = temp.y + phi_cap_s [m] * rtemp [m] .y; }

编辑

表达式

P =（K * K-（cc_points-（r_cc_points +（double）（mK）））*（cc_points- r_cc_points +（double）（mK））））;
可以细分为：
const double cc_diff = cc_points-r_cc_points; double exp = cc_diff - （double）（m-K）; exp * = exp; P =（K * K-exp）;
这可能会减少使用的指令数量。

编辑2

__ global__无效插值（cufftDoubleComplex * Uj，double * points，cufftDoubleComplex * result，int N，int M ） { int i = threadIdx.x + blockDim.x * blockIdx.x; int PP; double P，tempd; cufftDoubleComplex rtemp [（2 * K + 1）]; double phi_cap_s [2 * K + 1]; if（i const double cc_points = cc * points [i]; cufftDoubleComplex temp = make_cuDoubleComplex（0.，0。）; const double alfa =（2.-1./cc）* pi_double-0.01; const double r_cc_points = rint（cc_points）; const double cc_diff = cc_points-r_cc_points; #pragma unroll //展开循环 for（int m = 0; m <（2 * K + 1）; m ++）{ PP = m-k; //重用PP double exp = cc_diff - （double）（PP）; // stage exp以后使用，将解释 PP = modulo（（（int）r_cc_points + PP），（cc * N））; rtemp [m] = Uj [PP]; // 2 exp * = exp; P =（K * K-exp）; if（P <0）{tempd = rsqrt（-P）; phi_cap_s [m] =（1./pi_double)*((sin(alfa/tempd))*tempd）; } else if（P> 0。）{tempd = rsqrt（P）; phi_cap_s [m] =（1./pi_double)*((sinh(alfa/tempd))*tempd）; } else phi_cap_s [m] = alfa / pi_double; } #pragma unroll //展开循环 for（int m = 0; m <（2 * K + 1）; m ++）{ temp .x = temp.x + phi_cap_s [m] * rtemp [m] .x; temp.y = temp.y + phi_cap_s [m] * rtemp [m] .y; } result [i] = temp; } }
if语句在计算和内存提取方面释放一些资源，不知道你对第一个if语句 if（i 的分歧。由于 mK 在代码中出现了两次，我首先将它放在 PP 中， c> exp 和 PP 。

你还能做什么是尝试和命令你的指令，这样，如果你设置一个变量，在下一次使用所述变量之间做尽可能多的指令可能，因为它需要〜20 tics它被设置到寄存器。因此，我把常量cc_diff在顶部，但是，因为这只是一个指令，它可能不显示任何好处。

模函数

__ device__ modulo（int val，int _mod） { int p =（val&（_ mod-1））; // as modulo is always the power of 2 if（val< 0）{ return _mod - p; } else { return p; } }
由于 _mod 始终作为2的幂的整数（ cc = 2，N = 64，cc * N = 128 ），我们可以使用此函数mod运算符。这应该是多更快。检查它，虽然所以我有算术正确。它来自优化Cuda - 第二部分Nvidia 第14页。

ORIGINAL QUESTION

I have the following kernel performing an interpolation with nonuniform node points, and I would like to optimize it:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; int PP; double P; const double alfa=(2.-1./cc)*pi_double-0.01; double phi_cap_s; cufftDoubleComplex temp; double cc_points=cc*points[i]; double r_cc_points=rint(cc*points[i]); temp = make_cuDoubleComplex(0.,0.); if(i<M) { for(int m=0; m<(2*K+1); m++) { P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K))); if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P)); if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P)); if(P==0.) phi_cap_s = alfa/pi_double; PP = modulo((r_cc_points + m -K ),(cc*N)); temp.x = temp.x+phi_cap_s*Uj[PP].x; temp.y = temp.y+phi_cap_s*Uj[PP].y; } result[i] = temp; } }
K and cc are constants, points contains the nodes and Uj the values to be interpolated. modulo is a function basically working as %, but properly extended to negative values. For a certain arrangement, the kernel call takes 2.3ms. I have verified that the most expensive parts are
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P)); if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P)); if(P==0.) phi_cap_s = alfa/pi_double;
which takes about 40% of the total time, and
PP = modulo((r_cc_points + m -K ),(cc*N)); temp.x = temp.x+phi_cap_s*Uj[PP].x; temp.y = temp.y+phi_cap_s*Uj[PP].y;
which takes about 60%. By the Visual Profiler, I have verified that the performance of the former is not influenced by the presence of the if statement. Please, note that I want double precision, so I'm avoiding the __exp() solution. I suspect that, for the latter, the "random" memory access Uj[PP] could be responsible of that much calculation percentage. Any suggestion on tricks or comments to reduce the computation time? Thanks in advance.

VERSION FOLLOWING COMMENTS AND ANSWERS

Following the suggestions kindly provided in the answers and comments, I ended up with the code below:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; int PP; double P,tempd; const double alfa=(2.-1./cc)*pi_double-0.01; cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.); double cc_points=cc*points[i]; double r_cc_points=rint(cc_points); cufftDoubleComplex rtemp[(2*K+1)]; double phi_cap_s[2*K+1]; if(i<M) { #pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { PP = modulo(((int)r_cc_points + m -K ),(cc*N)); rtemp[m] = Uj[PP]; //2 P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K)))); if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap_s[m] = alfa/pi_double; } #pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; } result[i] = temp; } }
In particular: 1) I moved the global memory variable Uj to the register rtemp array of size 2*K+1 (K is a constant equal to 6 in my case); 2) I moved the variable phi_cap_s to a 2*K+1 sized register; 3) I used the if ... else statements instead of the three previously used if's (the conditions P<0. and P>0. have the same occurrence probability); 3) I defined extra variables for the square root; 4) I used rsqrt instead of sqrt (as long as I know, the sqrt() is calculated by CUDA as 1/rsqrt());

I added each new feature once at a time, verifying the improvement against the original version, but I must say that none of them gave me any relevant improvement.

The execution speed is limited by: 1) the calculation of the sin/sinh functions (about 40% of the time); is there any way to calculate them in double precision arithmetics by somehow exploiting intrinsic math as a "starting guess"? 2) the fact that many threads end up to access the same global memory locations Uj[PP] due to the mapping index PP; one possibility to avoid it would be using shared memory, but this would imply a strong thread cooperation.

My question is. Am I done? Namely, is there any mean to improve the code? I profiled the code by the NVIDIA Visual Profiler and here are the results:
IPC = 1.939 (compute capability 2.1); Global Memory Load Efficiency = 38.9%; Global Memory Store Efficiency = 18.8%; Warp Execution Efficiency = 97%; Instruction Replay Overhead = 0.7%;
Finally, I would like to notice that this discussion is linked to the discussion at CUDA: 1-dimensional cubic spline interpolation in CUDA

VERSION USING SHARED MEMORY

I have made a feasibility study on using shared memory. I have considered N=64 so that the whole Uj fits the shared memory. Below is the code (basically is my original version)
__global__ void interpolation_shared(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; int PP; double P; const double alfa=(2.-1./cc)*pi_double-0.01; double phi_cap_s; cufftDoubleComplex temp; double cc_points=cc*points[i]; double r_cc_points=rint(cc*points[i]); temp = make_cuDoubleComplex(0.,0.); __shared__ cufftDoubleComplex Uj_shared[128]; if (threadIdx.x < cc*N) Uj_shared[threadIdx.x]=Uj[threadIdx.x]; if(i<M) { for(int m=0; m<(2*K+1); m++) { P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K))); if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P)); if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P)); if(P==0.) phi_cap_s = alfa/pi_double; PP = modulo((r_cc_points + m -K ),(cc*N)); temp.x = temp.x+phi_cap_s*Uj_shared[PP].x; temp.y = temp.y+phi_cap_s*Uj_shared[PP].y; } result[i] = temp; } }
The result again does not improve significantly, although this might depend on the small size of the input array.

VERBOSE PTXAS OUTPUT
ptxas : info : Compiling entry function '_Z13interpolationP7double2PdS0_ii' for 'sm_20' ptxas : info : Function properties for _Z13interpolationP7double2PdS0_ii 352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas : info : Used 55 registers, 456 bytes cumulative stack size, 52 bytes cmem[0]
VALUES OF P, FOR FIRST WARP AND m=0
0.0124300933082964 0.0127183892149176 0.0135847002913749 0.0161796378170038 0.0155488126345702 0.0138890822153499 0.0121163187739057 0.0119998374528905 0.0131600831194518 0.0109574866163769 0.00962949548477354 0.00695850974164358 0.00446426651940612 0.00423369284281705 0.00632921297092537 0.00655137618976198 0.00810202954519923 0.00597974034698723 0.0076811348379735 0.00604267951733561 0.00402922460255439 0.00111841719893846 -0.00180949615796777 -0.00246283218698551 -0.00183256444286428 -0.000462696661685413 0.000725108980390132 -0.00126793006072035 0.00152263101649197 0.0022499598348702 0.00463681632275836 0.00359856091027666
MODULO FUNCTION
__device__ int modulo(int val, int modulus) { if(val > 0) return val%modulus; else { int P = (-val)%modulus; if(P > 0) return modulus -P; else return 0; } }
MODULO FUNCTION OPTIMIZED ACCORDING TO ANSWER
__device__ int modulo(int val, int _mod) { if(val > 0) return val&(_mod-1); else { int P = (-val)&(_mod-1); if(P > 0) return _mod -P; else return 0; } }

解决方案
//your code above cufftDoubleComplex rtemp[(2*K+1)] //if it fits into available registers, assumes K is a constant if(i<M) { #pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { PP = modulo((r_cc_points + m -K ),(cc*N)); rtemp[m] = Uj[PP]; //2 } #pragma unroll for(nt m=0; m<(2*K+1); m++) { P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K))); // 1 if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P)); else if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P)); else phi_cap_s = alfa/pi_double; temp.x = temp.x+phi_cap_s*rtemp[m].x; //3 temp.y = temp.y+phi_cap_s*rtemp[m].y; } result[i] = temp; }

Explanation

Added else if and else as these conditions are mutually exclusive, if you could, you should order the statements after probability of occurrence. E.g. if P<0. most of the times, you should evaluate that first.

This will fetch the requested memory to multiple registers, what you did before may have certainly caused a block on that thread due to not having the memory available in time for calculation. And keep in mind that if one thread blocks in a warp, the whole warp is blocked. If not enough warps in the ready queue, the program will block until any warp is ready.

We have now moved the calculations further forward in time in order to compensate for the bad memory access, hopefully the calculations done previously have compensated for the bad access pattern.

The reason why this should work is the following:

A request from memory that is in GMEM is around >~400-600 ticks. If a thread tries to perform operations on memory that is not available at time, it will block. That means if each memory request does not live in L1-L2 each warp have to wait that time or more until it can continue.

What I suspect is that temp.x+phi_cap_s*Uj[PP].x is doing just that. By staging (step 2) each memory transfer to a register, and moving on to stage the next you will hide the latency by allowing you to do other work while the memory is transfered.

By the time you reach step 3 the memory is hopefully available or you have to wait less time.

If rtemp does not fit into the registers to achieve 100% occupancy you may have to do it in batches.

You could also try to make phi_cap_s into an array and put it into the first loop like this:
#pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { //stage memory first PP = modulo((r_cc_points + m -K ),(cc*N)); rtemp[m] = Uj[PP]; //2 P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K))); // 1 if(P>0.) phi_cap_s[m] = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P)); else if(P<0.) phi_cap_s[m] = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P)); else phi_cap_s[m] = alfa/pi_double; } #pragma unroll for(nt m=0; m<(2*K+1); m++) { temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; //3 temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; }

Edit

Expression
P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
Can be broken down into:
const double cc_diff = cc_points-r_cc_points; double exp = cc_diff - (double)(m-K); exp *= exp; P = (K*K-exp);
Which may reduce the number of instructions used.

Edit 2

__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; int PP; double P,tempd; cufftDoubleComplex rtemp[(2*K+1)]; double phi_cap_s[2*K+1]; if(i<M) { const double cc_points=cc*points[i]; cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.); const double alfa=(2.-1./cc)*pi_double-0.01; const double r_cc_points=rint(cc_points); const double cc_diff = cc_points-r_cc_points; #pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { PP = m-k; //reuse PP double exp = cc_diff - (double)(PP); //stage exp to be used later, will explain PP = modulo(((int)r_cc_points + PP ),(cc*N)); rtemp[m] = Uj[PP]; //2 exp *= exp; P = (K*K-exp); if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap_s[m] = alfa/pi_double; } #pragma unroll //unroll the loop for(int m=0; m<(2*K+1); m++) { temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; temp.y = temp.y+phi_cap_s[m]*rtemp[m].y; } result[i] = temp; } }
What I have done is moved in all the calculations inside the if statement to free up some resources both in terms of calculations but also memory fetch, do not know the divergence you have on the first if statement if(i<M). As m-K appeared twice in the code, I first put it in PP to be used when you calculate exp and PP.

What else you can do is to try and order you instructions so that, if you set a variable, make as many instruction in in between the next usage of said variable as possible, as it takes ~20 tics for it to be set into the registers. Hence, I put the constant cc_diff at the top, however, as this is only a one of instruction, it may not show any benefit.

Modulo function

__device__ modulo(int val, int _mod) { int p = (val&(_mod-1));// as modulo is always the power of 2 if(val < 0) { return _mod - p; } else { return p; } }
As we have _mod always as an integer of power of 2(cc = 2, N = 64, cc*N = 128), we can use this function instead of the mod operator. This should be "much" faster. Check it though so I have the arithmetic correct. It is from Optimizing Cuda - Part II Nvidia page 14.

这篇关于优化具有非均匀节点的CUDA内核插值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化具有非均匀节点的CUDA内核插值 [英] Optimizing CUDA kernel interpolation with nonuniform node points

问题描述

说明

编辑

编辑2

模函数

Explanation

Edit

Edit 2

Modulo function

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

优化具有非均匀节点的CUDA内核插值 [英] Optimizing CUDA kernel interpolation with nonuniform node points

问题描述

说明

编辑

编辑2

模函数

Explanation

Edit

Edit 2

Modulo function

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭