跟随位操作的优化机会? [英] Optimization chance for following Bit-Operations?
问题描述
您认为函数haswon(见下文)有优化空间吗?
我认识到将参数类型从 __int64
更改为 unsigned __int64
可以使函数更快,因此我认为可能还有优化的机会.
更详细:我正在编写一个连接四个游戏.最近我使用了 Profiler Very Sleepy 并认识到该函数 haswon 使用了大部分 cpu 时间.该函数使用一位玩家的连接四板的位板表示.我在 fourstones 基准的来源中找到了该函数本身.位板表示如下:
<预><代码>.......最佳5 12 19 26 33 40 474 11 18 25 32 39 463 10 17 24 31 38 452 9 16 23 30 37 441 8 15 22 29 36 430 7 14 21 28 35 42 底部功能:
//返回新板是否包含胜利bool haswon(unsigned __int64 newboard){无符号 __int64 y = newboard &(新板>> 6);if (y & (y >> 2 * 6))//检查对角线返回真;y = 新板 &(新板>> 7);if (y & (y >> 2 * 7))//检查水平 -返回真;y = 新板 &(新板>>8);if (y & (y >> 2 * 8))//检查/对角线返回真;y = 新板 &(新板>> 1);if (y & (y >> 2))//检查垂直 |返回真;返回假;}
谢谢!
CPU 是 x86、32 位架构,我使用的是 Visual Studio 2008 Express Edition 的编译器.优化标志为/O2/Oi/GL.
我尝试了 Ben Jackson 建议的函数 haswon2.来自 Microsoft Compiler 的程序集,具有发布版本 (/O2/Oi/GL) 的默认优化标志,几乎没有显示运行时差异.与 gcc 相比,VC-Compiler 似乎无法利用它不能以严格的顺序评估每个条件.
结果:haswon 原文:
来自本杰克逊的haswon2:
编辑 2:haswon 的组装:
00401A10 mov eax,dword ptr [esp+4]00401A14 mov ecx,dword ptr [esp+8]00401A18 推ebx00401A19 推esi00401A1A推edi00401A1B mov edx,eax00401A1D mov edi,ecx00401A1F 碎片 edx,edi,600401A23 mov esi,edx00401A25 shr edi,600401A28 和 esi,eax00401A2A 和 edi,ecx00401A2C mov edx,esi00401A2E mov ebx,edi00401A30 碎片 edx,ebx,0Ch00401A34 shr ebx,0Ch00401A37 和 edx,esi00401A39 和 ebx,edi00401A3B 或 edx,ebx00401A3D je `匿名命名空间'::haswon+35h (401A45h)00401A3F mov al,100401A41 流行音乐00401A42 流行音乐00401A43 流行 ebx00401A44 ret00401A45 mov edx,eax00401A47 mov edi,ecx00401A49 碎片 edx,edi,700401A4D mov esi,edx00401A4F shr edi,700401A52 和 esi,eax00401A54 和 edi,ecx00401A56 mov edx,esi00401A58 mov ebx,edi00401A5A 碎片 edx,ebx,0Eh00401A5E shr ebx,0Eh00401A61 和 edx,esi00401A63 和 ebx,edi00401A65 或 edx,ebx00401A67 jne `匿名命名空间'::haswon+2Fh (401A3Fh)00401A69 mov edx,eax00401A6B mov edi,ecx00401A6D 碎片 edx,edi,800401A71 mov esi,edx00401A73 shr edi,800401A76 和 esi,eax00401A78 和 edi,ecx00401A7A mov edx,esi00401A7C mov ebx,edi00401A7E 碎片 edx,ebx,10h00401A82 shr ebx,10h00401A85 和 edx,esi00401A87 和 ebx,edi00401A89 或 edx,ebx00401A8B jne `匿名命名空间'::haswon+2Fh (401A3Fh)00401A8D mov edx,eax00401A8F mov esi,ecx00401A91 碎片 edx,esi,100401A95 shr esi,100401A97 和 esi,ecx00401A99 和 edx,eax00401A9B mov eax,edx00401A9D mov ecx,esi00401A9F 碎片 eax,ecx,200401AA3 shr ecx,200401AA6 和 eax,edx00401AA8 和 ecx,esi00401AAA 或 eax,ecx00401AAC jne `匿名命名空间'::haswon+2Fh (401A3Fh)00401AAE 流行音乐00401AAF 流行音乐00401AB0 xor al,al00401AB2 流行 ebx00401AB3 ret
这个版本背后的想法是避免严格的测试顺序(中间返回强制编译器一次一个地评估条件,按顺序)作为与多个 if 语句关联的分支:
//返回新板是否包含胜利bool haswon2(uint64_t newboard){uint64_t y = 新板 &(新板>> 6);uint64_t z = 新板 &(新板>> 7);uint64_t w = 新板 &(新板>>8);uint64_t x = 新板 &(新板>> 1);返回 (y & (y >> 2 * 6)) |//检查对角线(z & (z >> 2 * 7)) |//检查水平 -(w & (w >> 2 * 8)) |//检查/对角线(x & (x >> 2));//检查垂直 |}
通过适当的优化水平,您可以真正将 w、x、y 和 z 视为移位值的别名".这意味着最终的 return 语句将整个操作扔进一大堆供编译器使用的汤中.在我的系统上,此版本仅占用原始运行时间的 65%(包括每次生成随机位置的开销).如果棋盘主要是不赢的,它可能会以更大的比例获胜.
查看每个(来自 gcc -O3
)的反汇编,原始版本实际上更短,因此可能是在紧密的内部循环中缺少分支确实有帮助.
Do you think there is room for optimizations in the function haswon (see below)?
I recognized that changing the argument type from __int64
to unsigned __int64
made the function faster, thus i thougt maybe there is still a chance for optimization.
In more detail: I am writing a connect four game. Recently i used the Profiler Very Sleepy and recognized that the function haswon uses much of the cpu-time. The function uses a bitboard-representation of the connect-four-board for one player. The function itself i found in the sources of the fourstones benchmark. The bitboard representation is following:
. . . . . . . TOP
5 12 19 26 33 40 47
4 11 18 25 32 39 46
3 10 17 24 31 38 45
2 9 16 23 30 37 44
1 8 15 22 29 36 43
0 7 14 21 28 35 42 BOTTOM
The function:
// return whether newboard includes a win
bool haswon(unsigned __int64 newboard)
{
unsigned __int64 y = newboard & (newboard >> 6);
if (y & (y >> 2 * 6)) // check diagonal
return true;
y = newboard & (newboard >> 7);
if (y & (y >> 2 * 7)) // check horizontal -
return true;
y = newboard & (newboard >> 8);
if (y & (y >> 2 * 8)) // check / diagonal
return true;
y = newboard & (newboard >> 1);
if (y & (y >> 2)) // check vertical |
return true;
return false;
}
Thanks!
Edit: CPU is x86, 32 Bit Architecture, i'm using the Compiler from the Visual Studio 2008 Express Edition. Optimization Flags are /O2 /Oi /GL.
I tried the function haswon2 which Ben Jackson suggested. The assemblies from the Microsoft Compiler, with the default optimization flags for release versions (/O2 /Oi /GL), showing nearly no runtime differences. It looks like that the VC-Compiler in comparison to gcc can not take advantage that it must not evaluate each condition in strict order.
Results: haswon original:
haswon2 from Ben Jackson:
Edit2: Assembly of haswon:
00401A10 mov eax,dword ptr [esp+4]
00401A14 mov ecx,dword ptr [esp+8]
00401A18 push ebx
00401A19 push esi
00401A1A push edi
00401A1B mov edx,eax
00401A1D mov edi,ecx
00401A1F shrd edx,edi,6
00401A23 mov esi,edx
00401A25 shr edi,6
00401A28 and esi,eax
00401A2A and edi,ecx
00401A2C mov edx,esi
00401A2E mov ebx,edi
00401A30 shrd edx,ebx,0Ch
00401A34 shr ebx,0Ch
00401A37 and edx,esi
00401A39 and ebx,edi
00401A3B or edx,ebx
00401A3D je `anonymous namespace'::haswon+35h (401A45h)
00401A3F mov al,1
00401A41 pop edi
00401A42 pop esi
00401A43 pop ebx
00401A44 ret
00401A45 mov edx,eax
00401A47 mov edi,ecx
00401A49 shrd edx,edi,7
00401A4D mov esi,edx
00401A4F shr edi,7
00401A52 and esi,eax
00401A54 and edi,ecx
00401A56 mov edx,esi
00401A58 mov ebx,edi
00401A5A shrd edx,ebx,0Eh
00401A5E shr ebx,0Eh
00401A61 and edx,esi
00401A63 and ebx,edi
00401A65 or edx,ebx
00401A67 jne `anonymous namespace'::haswon+2Fh (401A3Fh)
00401A69 mov edx,eax
00401A6B mov edi,ecx
00401A6D shrd edx,edi,8
00401A71 mov esi,edx
00401A73 shr edi,8
00401A76 and esi,eax
00401A78 and edi,ecx
00401A7A mov edx,esi
00401A7C mov ebx,edi
00401A7E shrd edx,ebx,10h
00401A82 shr ebx,10h
00401A85 and edx,esi
00401A87 and ebx,edi
00401A89 or edx,ebx
00401A8B jne `anonymous namespace'::haswon+2Fh (401A3Fh)
00401A8D mov edx,eax
00401A8F mov esi,ecx
00401A91 shrd edx,esi,1
00401A95 shr esi,1
00401A97 and esi,ecx
00401A99 and edx,eax
00401A9B mov eax,edx
00401A9D mov ecx,esi
00401A9F shrd eax,ecx,2
00401AA3 shr ecx,2
00401AA6 and eax,edx
00401AA8 and ecx,esi
00401AAA or eax,ecx
00401AAC jne `anonymous namespace'::haswon+2Fh (401A3Fh)
00401AAE pop edi
00401AAF pop esi
00401AB0 xor al,al
00401AB2 pop ebx
00401AB3 ret
The idea behind this version is to avoid the strict testing order (the intermediate returns force the compiler to evaluate the conditions one at a time, in order) as well as the branching associated with multiple if statements:
// return whether newboard includes a win
bool haswon2(uint64_t newboard)
{
uint64_t y = newboard & (newboard >> 6);
uint64_t z = newboard & (newboard >> 7);
uint64_t w = newboard & (newboard >> 8);
uint64_t x = newboard & (newboard >> 1);
return (y & (y >> 2 * 6)) | // check diagonal
(z & (z >> 2 * 7)) | // check horizontal -
(w & (w >> 2 * 8)) | // check / diagonal
(x & (x >> 2)); // check vertical |
}
With a decent level of optimization you can really think of w, x, y and z as "aliases" for the shifted values. This means the final return statement throws the entire operation into a big soup for the compiler to play with. On my system this version takes only 65% of the runtime of the original (including the overhead of generating a random position each time). It might win by a larger percentage if the boards are mainly non-winning.
Looking at the disassembly of each (from gcc -O3
) the original version is actually shorter, so it's likely the lack of branching in the tight inner loop that really helps.
这篇关于跟随位操作的优化机会?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!