减少寄存器溢出? [英] Reducing register spilling?

查看:131
本文介绍了减少寄存器溢出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我一直在用C ++ AMP编写程序,虽然我的代码比第一个串行版本快很多,但最多仅比在ATI Radeon 5870(移动版)上使用WARP快2倍.

Hello I've been writing my program in C++ AMP and while my code is a lot faster than the first serial version, it is only at best 2x faster than using WARP on an ATI Radeon 5870 (mobile).

问题是要针对不同的初始数据多次求解(L + 6)个方程(L = 1,2,...,8).主要的计算工作是反复调用此函数:

The problem is to solve (L+6) equations (L = 1,2,...,8) lots of times for different initial data. The main computational effort is spent repeatedly calling this function:

//performs an RKCK step for mode functions
template<int L> void RKCK(array_view<float_2, 3> Parameters, array_view<float_2, 2> N_array, array_view<float, 2> K_array, amp_extent_2 &Traj_Modes, float TEMP, int Stage){


//loop over all trajectories
parallel_for_each(Traj_Modes, [=](index<2> TrajMode)restrict(amp){
		
	float N = N_array[TrajMode].x;		
	float K = K_array[TrajMode];
	float H = Parameters[TrajMode[0]]TrajMode[1]][L].x;
			
	if(K > TEMP*expf(N)*H && N > -0.9f){
						
		float_2 Param_old[L+6];	

		float dN = N_array[TrajMode].y;									
		for(int l = 0; l < L+6; l++) Param_old[l] = Parameters[TrajMode[0]][TrajMode[1]][l];
			
		try_a_step<L>(Param_old, K, N, dN, TEMP, Stage);	//long calculation...
							
		for(int l = 0; l < L+6; l++) Parameters[TrajMode[0]][TrajMode[1]][l] = Param_old[l];		
		N_array[TrajMode].x = N;	
		N_array[TrajMode].y = dN;	
		}
});
}


该过程类似于Nbody样本,其中N =时间,dN = deltatime,并且Param_old [L + 6]将包含粒子的位置,速度等.

The process is similar to the Nbody sample where N = time, dN = deltatime and Param_old[L+6] would contain the particles position, velocity etc.

我怀疑一个主要问题是由我需要创建的局部变量数量引起的.除了上述功能外,try_a_step还创建以下内容:

I suspect one main problem is caused by the amount of local variables I need to create. In addition to those above the function try_a_step creates the following:

float_2 Param_new [L + 6]

float_2 Param_new[L+6]

float_2 Param_error [L + 6]

float_2 Param_error[L+6]

float_2 ak [6 *(L + 6)] 

float_2 ak[6*(L+6)] 

所以在最简单的情况下,每个线程至少会有100个浮点数,这意味着它们可能会溢出到全局内存中?

So in the simplest case there's going to be atleast 100 floats for every thread which means they are probably spilling to global memory?

这会比将参数重写为2个不同的array_view< float>会产生更大的影响吗?

Will this have a greater impact than say, re-writing Parameters as 2 different array_view<float>?

除了将每个呼叫分成几个并行之外,我还有很多事情要做吗?

Is there much else I can do apart from splitting it up into several parallel for each calls?



推荐答案

您好Antediluvian99,如果您的100个局部变量的寿命全部重叠,即它们需要在整个内核中共存,这将是性能损失的关注点.但是,我首先会尝试优化全局内存访问:在您的 代码中,对参数的所有访问都不会合并.如果您可以重新构造参数的布局,以使连续的线程大部分访问参数的连续元素,则可能会导致不错的加速.
Hi Antediluvian99, if all of your 100 local variables' life time heavily overlapped, i.e., they need to co-exist through out the whole kernel, it would be a concern for performance loss. However, I would first try to optimize global memory access: in your code, all accesses to Parameters are not coalesced. If you could restructure the layout of Parameters so that the consecutive threads mostly access the consecutive elements of Parameters, it might result in a decent speedup.


这篇关于减少寄存器溢出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆