如何让GCC编译器将变量分成多个(如果更快) [英] How to let GCC compiler turn variable-division into mul(if faster)

查看:92
本文介绍了如何让GCC编译器将变量分成多个(如果更快)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  int a,b; 
scanf(%d%d,& a,& b);
printf(%d \ n,(unsigned int)a /(unsigned char)b);

编译时,我得到了
...

  :: 00401C1E :: C70424 24304000 MOV DWORD PTR [ESP],403024%d%d 
:: 00401C25 :: E8 36FFFFFF CALL 00401B60 scanf
:: 00401C2A :: 0FB64C24 1C MOVZX ECX,BYTE PTR [ESP + 1C]
:: 00401C2F :: 8B4424 18 MOV EAX,[ESP + 18]
:: 00401C33 :: 31D2 XOR EDX ,EDX
:: 00401C35 :: F7F1 DIV ECX
:: 00401C37 :: 894424 04 MOV [ESP + 4],EAX
:: 00401C3B :: C70424 2A304000 MOV DWORD PTR [ESP] ,40302A%d\x0A
:: 00401C42 :: E8 21FFFFFF CALL 00401B68 printf

如果DIV变成MUL并使用一个数组来存储mulvalue,会更快吗?如果是这样,如何让编译器进行优化?

  int main(){
uint a,s = 0,i,t;
scanf(%d,& a);
diviuint aa = a;
t = clock();
for(i = 0; i <1000000000; i ++)
s + = i / a;
printf(Result:%10u \\\
,s);
printf(Time:%12u \\\
,clock() - t);
返回0;
}

其中,diviuint(a)记忆1 / a并使用多个
使用s + = i / aa使速度增加2倍s + = i / a

解决方案

如果循环内的整数除法是不可避免的,找到乘法逆可能是值得的。尽管如此,gcc和clang不会为你使用运行时常量。只有编译时常量。如果编译器不确定需要它,代价太大(代码大小),而且对于非编译时常量,性能增益不会很大。 (我不确定加速是否可能,这取决于目标微体系结构有多好的整数除法。)






< h3>使用乘法逆

如果不能转换东西以将循环拉出循环,并且它会运行许多迭代,并且代码大小的显着增加是随着性能的提高而增加的(例如,对于隐藏div等待时间的高速缓存未命中错误,您不会遇到瓶颈),那么您可以加快运行时常量的速度,时间常数。

请注意,不同的常数需要全乘的高一半的不同位移,而一些常数需要比其他常数更多的不同位移。 (另一种说法是某些常数的某些移位计数为零)。所以非编译时恒定的除法乘法代码需要所有的移位,并且移位计数必须是变量计数。 (在x86上,这比立即计数更为昂贵)。

libdivide 具有必要的数学实现。我想你可以用它来做SIMD向量化的划分,或者用于标量。这肯定会提供一个超过解包的标准和在那里进行整数划分的大大加速。 (英特尔SSE / AVX在硬件上并没有进行整数除法,但提供了多种乘法和相当高效的可变长度乘法器,计数移位指令对于16位元素,有一条指令只产生高一半的乘法,对于32位元素,则有一个扩大的乘法,所以你需要一个随机的洗牌。)



无论如何,您可以使用libdivide来向量化该添加循环,并在最后以水平和结束。






< h3>将div从循环中取出的其他方法

  for(i = 0; i <1000000000; i ++)
s + = i / a;

在你的例子中,使用 uint128_t s 累加器,并在循环外部除以 a 。 64位add / adc对相当便宜。 (它不会给出相同的结果,但是,因为整数划分截断,而不是四舍五入到最接近的。)



我想你可以通过循环 i + = a; tmp ++ ,并执行 s + = tmp * a ,以合并所有来自 i / a 是一样的。因此, s + = 1 * a 解释了从 i = [a .. a * 2-1] 。显然这只是一个简单的例子,并且循环更有效率通常实际上是不可能的。这个问题无关紧要,但值得一提的是:通过重新构造代码或利用一些数学方法来寻求大的优化,然后再加快完成同样的事情。说到数学,你可以在这里使用 sum(0..n)= n *(n + 1)/ 2 公式,因为我们可以将 a 超出 a * 1 + a * 2 + a * 3 ... a * max 。我可能在这里有一个off-by-one,但我相信一个封闭形式的简单常量时间计算将给出与任何 a 的循环相同的答案:

  uint32_t n = 1000000000 / a; 
uint32_t s = a * n *(n + 1)/ 2 + 1000000000%a;






如果您只需要我/ a 在一个循环中,它可能值得做一些事情:

  // (uint32_t i = 0,余数= 0,i_over_a = 0; i< n; i ++){
//使用i_over_a

+ +剩余;
if(remaining == a){//如果你不需要循环中的余数,它可以保存一个或两个从a到0的倒数,而不是从0到a的倒数。在x86上。但是,除了余数外,你需要一个聪明的变量名。
余数= 0;
++ i_over_a;


$ / code>

再一次,这是不太可能的:它只适用于你'将循环计数器除以常量。但是,它应该运作良好。任何 a 都很大,所以分支错误预测将不经常发生,或者 a (有希望)足够小,以一种方式识别 a-1 的重复模式,然后以另一种方式分支1个分支。取决于微体系结构,最坏的情况 a 值可能是33或65等。无分支的asm可能是可能的,但不值得。例如处理 ++ i_over_a ,并附带进位和用于归零的条件移动。 (例如,x86伪代码 cmp a-1,余数 / cmovc余数,0 / adc i_over_a,0 b (下面)条件只是 CF == 1 ,与 c (carry)条件相同。无分支汇编将通过从a递减到0来简化。(不需要cmov为零的reg,并且可以具有 a 而不是 a-1 ))


int a, b;
scanf("%d %d", &a, &b);
printf("%d\n", (unsigned int)a/(unsigned char)b);

When compiling, I got ...

    ::00401C1E::  C70424 24304000          MOV DWORD PTR [ESP],403024  %d %d
    ::00401C25::  E8 36FFFFFF              CALL 00401B60               scanf
    ::00401C2A::  0FB64C24 1C              MOVZX ECX,BYTE PTR [ESP+1C]
    ::00401C2F::  8B4424 18                MOV EAX,[ESP+18]                        
    ::00401C33::  31D2                     XOR EDX,EDX                             
    ::00401C35::  F7F1                     DIV ECX                                 
    ::00401C37::  894424 04                MOV [ESP+4],EAX                         
    ::00401C3B::  C70424 2A304000          MOV DWORD PTR [ESP],40302A  %d\x0A
    ::00401C42::  E8 21FFFFFF              CALL 00401B68               printf

Will it be faster if the DIV turn into MUL and use an array to store the mulvalue? If so, how to let the compiler do the optimization?

int main() {
    uint a, s=0, i, t;
    scanf("%d", &a);
    diviuint aa = a;
    t = clock();
    for (i=0; i<1000000000; i++)
        s += i/a;
    printf("Result:%10u\n", s);
    printf("Time:%12u\n", clock()-t);
    return 0;
}

where diviuint(a) make a memory of 1/a and use multiple instead Using s+=i/aa makes the speed 2 times of s+=i/a

解决方案

You are correct that finding the multiplicative inverse may be worth it if integer division inside a loop is unavoidable. gcc and clang won't do this for you with run-time constants, though; only compile-time constants. It's too expensive (in code-size) for the compiler to do without being sure it's needed, and the perf gains aren't as big with non compile-time constants. (I'm not confident a speedup will always be possible, depending on how good integer division is on the target microarchitecture.)


Using a multiplicative inverse

If you can't transform things to pull the divide out of the loop, and it runs many iterations, and a significant increase in code-size is with the performance gain (e.g. you aren't bottlenecked on cache misses that hide the div latency), then you might get a speedup from doing for run-time constants what the compiler does for compile-time constants.

Note that different constants need different shifts of the high half of the full-multiply, and some constants need more different shifts than others. (Another way of saying that some of the shift-counts are zero for some constants). So non-compile-time-constant divide-by-multiplying code needs all the shifts, and the shift counts have to be variable-count. (On x86, this is more expensive than immediate-count shifts).

libdivide has an implementation of the necessary math. You can use it to do SIMD-vectorized division, or for scalar, I think. This will definitely provide a big speedup over unpacking to scalar and doing integer division there. I haven't used it myself.

(Intel SSE/AVX doesn't do integer-division in hardware, but provides a variety of multiplies, and fairly efficient variable-count shift instructions. For 16bit elements, there's an instruction that produces only the high half of the multiply. For 32bit elements, there's a widening multiply, so you'd need a shuffle with that.)

Anyway, you could use libdivide to vectorize that add loop, with a horizontal sum at the end.


Other ways to get the div out of the loop

for (i=0; i<1000000000; i++)
    s += i/a;

In your example, you might get better results from using a uint128_t s accumulator and dividing by a outside the loop. A 64bit add/adc pair is pretty cheap. (It wouldn't give identical results, though, because integer division truncates instead of rounding to nearest.)

I think you can account for that by looping with i += a; tmp++, and doing s += tmp*a, to combine all the adds from iterations where i/a is the same. So s += 1 * a accounts for all the iterations from i = [a .. a*2-1]. Obviously that was just a trivial example, and looping more efficiently is usually not actually possible. It's off-topic for this question, but worth saying anyway: Look for big optimizations by re-structuring code or taking advantage of some math before trying to speed up doing the exact same thing faster. Speaking of math, you can use the sum(0..n) = n * (n+1) / 2 formula here, because we can factor a out of a*1 + a*2 + a*3 ... a*max. I may have an off-by-one here, but I'm confident a closed-form simple constant time calculation will give the same answer as the loop for any a:

uint32_t n = 1000000000 / a;
uint32_t s = a * n*(n+1)/2 + 1000000000 % a;


If you just needed i/a in a loop, it might be worth it to do something like:

// another optimization for an unlikely case
for (uint32_t i=0, remainder=0, i_over_a=0 ; i < n ; i++) {
    // use i_over_a

    ++remainder;
    if (remainder == a) {        // if you don't need the remainder in the loop, it could save an insn or two to count down from a to 0 instead of up from 0 to a, e.g. on x86.  But then you need a clever variable name other than remainder.
        remainder = 0;
        ++i_over_a;
    }
}

Again, this is unlikely: it only works if you're dividing the loop counter by a constant. However, it should work well. Either a is large so branch mispredicts will be infrequent, or a is (hopefully) small enough for a good branch predictor to recognize the repeating pattern of a-1 branches one way, then 1 branch the other way. The worst-case a value might be 33 or 65 or something, depending on microarchitecture. Branchless asm is probably possible but not worth it. e.g. handle ++i_over_a with an add-with-carry and a conditional move for zeroing. (e.g. x86 pseudo-code cmp a-1, remainder / cmovc remainder, 0 / adc i_over_a, 0. The b (below) condition is just CF==1, same as the c (carry) condition. The branchless asm would be simplified by decrementing from a to 0. (don't need a zeroed reg for cmov, and could have a in a reg instead of a-1))

这篇关于如何让GCC编译器将变量分成多个(如果更快)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆