循环地址对齐如何影响Intel x86_64的速度? [英] How does loop address alignment affect the speed on Intel x86_64?
问题描述
我看到编译成完全相同的机器指令但位于不同对齐地址上的同一C ++代码的性能下降了15%.当我的微小主循环从0x415220开始时,速度比在0x415250处更快.我正在Intel Core2 Duo上运行它.我在x86_64 Ubuntu上使用gcc 4.4.5.
I'm seeing 15% performance degradation of the same C++ code compiled to exactly same machine instructions but located on differently aligned addresses. When my tiny main loop starts at 0x415220 it's faster then when it is at 0x415250. I'm running this on Intel Core2 Duo. I use gcc 4.4.5 on x86_64 Ubuntu.
有人可以解释速度下降的原因吗?我怎样才能迫使gcc最佳地对齐循环?
Can anybody explain the cause of slowdown and how I can force gcc to optimally align the loop?
这是带有事件探查器注释的两种情况的反汇编:
Here is the disassembly for both cases with profiler annotation:
415220 576 12.56% |XXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx
415224 110 2.40% |XX 0f b6 c3 movzbl %bl,%eax
415227 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax
41522c 40 0.87% | 48 8b 04 c1 mov (%rcx,%rax,8),%rax
415230 806 17.58% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15
415233 186 4.06% |XXXX 48 c1 e8 20 shr $0x20,%rax
415237 102 2.22% |XX 4c 01 f9 add %r15,%rcx
41523a 414 9.03% |XXXXXXXXXX a8 0f test $0xf,%al
41523c 680 14.83% |XXXXXXXXXXXXXXXX 74 45 je 415283 ::Run(char const*, char const*)+0x4b3>
41523e 0.00% | 41 89 c7 mov %eax,%r15d
415241 0.00% | 41 83 e7 01 and $0x1,%r15d
415245 0.00% | 41 83 ff 01 cmp $0x1,%r15d
415249 0.00% | 41 89 c7 mov %eax,%r15d
415250 679 13.05% |XXXXXXXXXXXXXXXX 48 c1 eb 08 shr $0x8,%rbx
415254 124 2.38% |XX 0f b6 c3 movzbl %bl,%eax
415257 0.00% | 41 0f b6 04 00 movzbl (%r8,%rax,1),%eax
41525c 43 0.83% |X 48 8b 04 c1 mov (%rcx,%rax,8),%rax
415260 828 15.91% |XXXXXXXXXXXXXXXXXXX 4c 63 f8 movslq %eax,%r15
415263 388 7.46% |XXXXXXXXX 48 c1 e8 20 shr $0x20,%rax
415267 141 2.71% |XXX 4c 01 f9 add %r15,%rcx
41526a 634 12.18% |XXXXXXXXXXXXXXX a8 0f test $0xf,%al
41526c 749 14.39% |XXXXXXXXXXXXXXXXXX 74 45 je 4152b3 ::Run(char const*, char const*)+0x4c3>
41526e 0.00% | 41 89 c7 mov %eax,%r15d
415271 0.00% | 41 83 e7 01 and $0x1,%r15d
415275 0.00% | 41 83 ff 01 cmp $0x1,%r15d
415279 0.00% | 41 89 c7 mov %eax,%r15d
推荐答案
Gcc具有 -falign-loops = n 选项,其中 n 是最大数量跳过的字节数,如果忽略,将使用默认的计算机.Gcc会在 -O2 和 -O3 优化级别自动启用此功能.
Gcc has a -falign-loops=n option where n is the maximum number of bytes to skip, if omitted machine default will be used. Gcc auto-enables this at -O2 and -O3 optimization levels.
这篇关于循环地址对齐如何影响Intel x86_64的速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!