优化的机会,以下位操作? [英] Optimization chance for following Bit-Operations?

查看:132
本文介绍了优化的机会,以下位操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你觉得有空间的功能haswon优化(见下文)?

我认为,改变从 __的Int64 参数类型为无符号__int64 提出的功能更快,因此我thougt也许还有优化的机会。

的详细信息: 我写一个连接四游戏。最近我用事件探查器的很困的,并承认该功能的 haswon 的使用太多的CPU时间。该函数使用连接四板的棋盘,再presentation一个球员。该函数本身,我发现在 fourstones 基准的来源。在棋盘重新presentation是以下内容:

 。 。 。 。 。 。 。最佳
5 12 19 26 33 40 47
4 11 18 25 32 39 46
3 10 17 24 31 38 45
2 9 16 23 30 37 44
1 8 15 22 29 36 43
0 7 14 21 28 35 42底
 

功能:

  //返回newboard是否包括一个双赢
布尔haswon(无符号__int64 newboard)
{
    无符号__int64 Y = newboard和放大器; (newboard>&→6);
    如果(Y及(Y'GT;> 2 * 6))//检查\对角线
        返回true;
    Y = newboard和放大器; (newboard>大于7);
    如果(Y及(Y'GT;> 2 * 7))//检查水平 - 
        返回true;
    Y = newboard和放大器; (newboard>→8);
    如果(Y及(Y'GT;> 2 * 8))//检查/对角线
        返回true;
    Y = newboard和放大器; (newboard>大于1);
    如果(Y及(Y'GT;> 2))//检查垂直|
        返回true;
    返回false;
}
 

谢谢!

编辑: CPU是86,32位架构,我使用的编译器从Visual Studio 2008例preSS版。优化标志是/ O2 /爱/ G​​L。

我试过功能haswon2这本杰克逊建议。从微软编译器的组件,与缺省的优化参数发布版本(/ O2 /爱/ G​​L),显示几乎没有运行时的差异。它看起来像VC的编译器相比,GCC不能利用它不能评估每个条件严格的顺序。

结果: haswon原文:

这是本·杰克逊haswon2:

EDIT2: haswon的装配:

  00401A10 MOV EAX,DWORD PTR [ESP + 4]
00401A14 MOV ECX,DWORD PTR [ESP + 8]
00401A18推EBX
00401A19推ESI
00401A1A推EDI
00401A1B MOV EDX,EAX
00401A1D MOV EDI,ECX
00401A1F SHRD EDX,EDI,6
00401A23 MOV ESI,EDX
00401A25 SHR EDI,6
00401A28和ESI,EAX
00401A2A和EDI,ECX
00401A2C MOV EDX,ESI
00401A2E MOV EBX,EDI
00401A30 SHRD EDX,EBX,0CH
00401A34 SHR EBX,0CH
00401A37和EDX,ESI
00401A39和EBX,EDI
00401A3B或EDX,EBX
00401A3D济`匿名命名空间:: haswon + 35H(401A45h)
00401A3F MOV人,1
00401A41流行EDI
00401A42流行ESI
00401A43流行EBX
00401A44 RET
00401A45 MOV EDX,EAX
00401A47 MOV EDI,ECX
00401A49 SHRD EDX,EDI,7
00401A4D MOV ESI,EDX
00401A4F SHR EDI,7
00401A52和ESI,EAX
00401A54和EDI,ECX
00401A56 MOV EDX,ESI
00401A58 MOV EBX,EDI
00401A5A SHRD EDX,EBX,0EH
00401A5E SHR EBX,0EH
00401A61和EDX,ESI
00401A63和EBX,EDI
00401A65或EDX,EBX
00401A67 JNE`匿名命名空间:: haswon + 2Fh的(401A3Fh)
00401A69 MOV EDX,EAX
00401A6B MOV EDI,ECX
00401A6D SHRD EDX,EDI,8
00401A71 MOV ESI,EDX
00401A73 SHR EDI,8
00401A76和ESI,EAX
00401A78和EDI,ECX
00401A7A MOV EDX,ESI
00401A7C MOV EBX,EDI
00401A7E SHRD EDX,EBX,10H
00401A82 SHR EBX,10H
00401A85和EDX,ESI
00401A87和EBX,EDI
00401A89或EDX,EBX
00401A8B JNE`匿名命名空间:: haswon + 2Fh的(401A3Fh)
00401A8D MOV EDX,EAX
00401A8F MOV ESI,ECX
00401A91 SHRD EDX,ESI,1
00401A95 SHR ESI,1
00401A97和ESI,ECX
00401A99和EDX,EAX
00401A9B MOV EAX,EDX
00401A9D MOV ECX,ESI
00401A9F SHRD EAX,ECX,2
00401AA3 SHR ECX,2
00401AA6和EAX,EDX
00401AA8和ECX,ESI
00401AAA或EAX,ECX
00401AAC JNE`匿名命名空间:: haswon + 2Fh的(401A3Fh)
00401AAE流行EDI
00401AAF流行ESI
00401AB0异人,人
00401AB2流行EBX
00401AB3 RET
 

解决方案

这背后版本的想法是为了避免严格的检测顺序(中间回报强制编译器来评估条件之一的时间,按顺序),以及作为分支与多个相关联的if语句:

  //返回newboard是否包括一个双赢
布尔haswon2(uint64_t中newboard)
{
    uint64_t中Y = newboard和放大器; (newboard>&→6);
    uint64_t中Z = newboard和放大器; (newboard>大于7);
    uint64_t中W = newboard和放大器; (newboard>→8);
    uint64_t中X = newboard和放大器; (newboard>大于1);
    返回(γ及(Y'GT;→2 * 6))| //检查\对角线
           (Z及(Z取代;→2 * 7))| //检查水平 - 
           (瓦特及(并且R w→2 * 8))| //检查/对角线
           (X及(X  -  GT;&→2)); //检查垂直|
}
 

通过优化体面的水平,你真的可以想到的W,X,Y和Z为别名,为位移后的值。这意味着最后一个return语句引发整个操作成一个大汤编译器一起玩。在我的系统这个版本只需要65%的原始(包括生成每次随机位置的开销)的运行时间。它可以通过一个较大的比例胜出,如果板是主要非中标。

综观各拆装(从 GCC -O3 )的原始版本其实是较短,所以很可能缺乏分支在紧凑的内部循环,真正帮助的

Do you think there is room for optimizations in the function haswon (see below)?

I recognized that changing the argument type from __int64 to unsigned __int64 made the function faster, thus i thougt maybe there is still a chance for optimization.

In more detail: I am writing a connect four game. Recently i used the Profiler Very Sleepy and recognized that the function haswon uses much of the cpu-time. The function uses a bitboard-representation of the connect-four-board for one player. The function itself i found in the sources of the fourstones benchmark. The bitboard representation is following:

.  .  .  .  .  .  .  TOP
5 12 19 26 33 40 47
4 11 18 25 32 39 46
3 10 17 24 31 38 45
2  9 16 23 30 37 44
1  8 15 22 29 36 43
0  7 14 21 28 35 42  BOTTOM

The function:

// return whether newboard includes a win
bool haswon(unsigned __int64 newboard)
{
    unsigned __int64 y = newboard & (newboard >> 6);
    if (y & (y >> 2 * 6)) // check \ diagonal
        return true;
    y = newboard & (newboard >> 7);
    if (y & (y >> 2 * 7)) // check horizontal -
        return true;
    y = newboard & (newboard >> 8);
    if (y & (y >> 2 * 8)) // check / diagonal
        return true;
    y = newboard & (newboard >> 1);
    if (y & (y >> 2))     // check vertical |
        return true;
    return false;
}

Thanks!

Edit: CPU is x86, 32 Bit Architecture, i'm using the Compiler from the Visual Studio 2008 Express Edition. Optimization Flags are /O2 /Oi /GL.

I tried the function haswon2 which Ben Jackson suggested. The assemblies from the Microsoft Compiler, with the default optimization flags for release versions (/O2 /Oi /GL), showing nearly no runtime differences. It looks like that the VC-Compiler in comparison to gcc can not take advantage that it must not evaluate each condition in strict order.

Results: haswon original:

haswon2 from Ben Jackson:

Edit2: Assembly of haswon:

00401A10  mov         eax,dword ptr [esp+4] 
00401A14  mov         ecx,dword ptr [esp+8] 
00401A18  push        ebx  
00401A19  push        esi  
00401A1A  push        edi  
00401A1B  mov         edx,eax 
00401A1D  mov         edi,ecx 
00401A1F  shrd        edx,edi,6 
00401A23  mov         esi,edx 
00401A25  shr         edi,6 
00401A28  and         esi,eax 
00401A2A  and         edi,ecx 
00401A2C  mov         edx,esi 
00401A2E  mov         ebx,edi 
00401A30  shrd        edx,ebx,0Ch 
00401A34  shr         ebx,0Ch 
00401A37  and         edx,esi 
00401A39  and         ebx,edi 
00401A3B  or          edx,ebx 
00401A3D  je          `anonymous namespace'::haswon+35h (401A45h) 
00401A3F  mov         al,1 
00401A41  pop         edi  
00401A42  pop         esi  
00401A43  pop         ebx  
00401A44  ret              
00401A45  mov         edx,eax 
00401A47  mov         edi,ecx 
00401A49  shrd        edx,edi,7 
00401A4D  mov         esi,edx 
00401A4F  shr         edi,7 
00401A52  and         esi,eax 
00401A54  and         edi,ecx 
00401A56  mov         edx,esi 
00401A58  mov         ebx,edi 
00401A5A  shrd        edx,ebx,0Eh 
00401A5E  shr         ebx,0Eh 
00401A61  and         edx,esi 
00401A63  and         ebx,edi 
00401A65  or          edx,ebx 
00401A67  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401A69  mov         edx,eax 
00401A6B  mov         edi,ecx 
00401A6D  shrd        edx,edi,8 
00401A71  mov         esi,edx 
00401A73  shr         edi,8 
00401A76  and         esi,eax 
00401A78  and         edi,ecx 
00401A7A  mov         edx,esi 
00401A7C  mov         ebx,edi 
00401A7E  shrd        edx,ebx,10h 
00401A82  shr         ebx,10h 
00401A85  and         edx,esi 
00401A87  and         ebx,edi 
00401A89  or          edx,ebx 
00401A8B  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401A8D  mov         edx,eax 
00401A8F  mov         esi,ecx 
00401A91  shrd        edx,esi,1 
00401A95  shr         esi,1 
00401A97  and         esi,ecx 
00401A99  and         edx,eax 
00401A9B  mov         eax,edx 
00401A9D  mov         ecx,esi 
00401A9F  shrd        eax,ecx,2 
00401AA3  shr         ecx,2 
00401AA6  and         eax,edx 
00401AA8  and         ecx,esi 
00401AAA  or          eax,ecx 
00401AAC  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401AAE  pop         edi  
00401AAF  pop         esi  
00401AB0  xor         al,al 
00401AB2  pop         ebx  
00401AB3  ret    

解决方案

The idea behind this version is to avoid the strict testing order (the intermediate returns force the compiler to evaluate the conditions one at a time, in order) as well as the branching associated with multiple if statements:

// return whether newboard includes a win
bool haswon2(uint64_t newboard)
{
    uint64_t y = newboard & (newboard >> 6);
    uint64_t z = newboard & (newboard >> 7);
    uint64_t w = newboard & (newboard >> 8);
    uint64_t x = newboard & (newboard >> 1);
    return (y & (y >> 2 * 6)) | // check \ diagonal
           (z & (z >> 2 * 7)) | // check horizontal -
           (w & (w >> 2 * 8)) | // check / diagonal
           (x & (x >> 2));      // check vertical |
}

With a decent level of optimization you can really think of w, x, y and z as "aliases" for the shifted values. This means the final return statement throws the entire operation into a big soup for the compiler to play with. On my system this version takes only 65% of the runtime of the original (including the overhead of generating a random position each time). It might win by a larger percentage if the boards are mainly non-winning.

Looking at the disassembly of each (from gcc -O3) the original version is actually shorter, so it's likely the lack of branching in the tight inner loop that really helps.

这篇关于优化的机会,以下位操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆