在一个对象文件code对齐影响的函数的另一个对象文件中的表现 [英] Code alignment in one object file is affecting the performance of a function in another object file

查看:207
本文介绍了在一个对象文件code对齐影响的函数的另一个对象文件中的表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉的数据校准和性能,但我相当新的调整code。我在X86-64组装开始编程最近与NASM,并使用code调整已经比较性能。至于我可以告诉NASM插入 NOP 指令来实现code对齐。

下面是一个功能我一直在一个Ivy Bridge的系统上尝试这种

 无效黑社会(浮点* X,浮动* Y,浮动* Z,整数N,INT重复){
    浮K = 3.14159f;
    INT(INT R = 0;为r重复; R ++){
        的for(int i = 0; I< N;我++){
            Z [i] = X [I] + K * Y [I]
        }
    }
}

我使用该汇编如下。如果我不指定对准我的表现相比,峰值仅为90%左右。然而,当我在环和两个内环16个字节的性能跳转到96%之前校准code。如此明确在此情况下,code对准有差别。

但这里是奇怪的一部分。如果我对齐的最内环路32字节它使在此功能的性能没有差异,但是,在使用内部函数在一个单独的对象文件I在其性能链接跳转从90%至95%的该功能的另一个版本!

我做了一个对象转储(使用 objdump的-d -M英特尔)的对齐到16字节的版本(我张贴的结果对这个问题的结束)和32个字节,它们是相同的!事实证明,最内层循环在两个对象文件无论如何对齐到32个字节。但必须有一定差异。

我做了每个目标文件的十六进制转储并有不同之处在于目标文件一个字节。对齐到16字节的对象文件与 0×10 对齐到32字节一个字节和目标文件有一个字节 0x20的究竟是怎么回事么!为什么在一个对象文件code调整影响另一个目标文件中的函数的性能?我怎么知道什么是使我的code为最佳值?

我唯一的猜测是,当code是由32个字节对​​齐对象文件影响使用内在另一个对象文件加载搬迁。您可以在code,以测试这一切在<一个href=\"https://stackoverflow.com/questions/25899395/obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62\">Obtaining在Haswell的峰值带宽的L1缓存:只获得62%

在NASM code我使用的:

 全球triad_avx_asm_repeat
; RDI X,RSI Y,RDX Z,RCX N,R8重复
PI:DD 3.14159
16对齐
.text段
    triad_avx_asm_repeat:
    SHL RCX,2
    加RDI,RCX
    添加RSI,RCX
    添加RDX,RCX
    vbroadcastss ymm2,[REL PI]
    ;负RCX16对齐
.L1:
    MOV RAX,RCX
    NEG RAX
16对齐
.L2:
    vmulps ymm1,ymm2,[RDI + RAX]
    vaddps ymm1,ymm1,[RSI + RAX]
    vmovaps [RDX + RAX],ymm1
    添加RAX,32
    JNE .L2
    子r8d,1
    JNZ .L1
    vzeroupper
    RET

objdump的-d -M英特尔test16.o结果。拆卸是相同的,如果我换对齐16 对齐32 上面只是组装前 .L2 。然而,对象文件仍然由一个字节不同。

  test16.o:文件格式ELF64,X86-64
.text段拆卸:0000000000000000&所述; PI计算值:
   0:D0 0F ROR BYTE PTR [RDI] 1
   2:49 rex.WB
   3:40 90雷克斯XCHG EAX,EAX
   5:90 NOP
   6:90 NOP
   7:90 NOP
   8:90 NOP
   9:90 NOP
   答:90 NOP
   B:90 NOP
   C:90 NOP
   D:90 NOP
   E:90 NOP
   F:90 NOP0000000000000010&所述; triad_avx_asm_repeat计算值:
  10:48 C1 E1 02 SHL RCX,0X2
  14:48 01 CF加偏下,RCX
  17:48 01 CE添加RSI,RCX
  1A:48 01 ca的增加RDX,RCX
  1D:C4 E2 7D 18 15 DA FF vbroadcastss ymm2,DWORD PTR [RIP + 0xffffffffffffffda]#0;&PI GT;
  24:FF FF
  26:90 NOP
  27:90 NOP
  28:90 NOP
  29:90 NOP
  2A:90 NOP
  2B:90 NOP
  2C:90 NOP
  2D:90 NOP
  2E:90 NOP
  2F:90 NOP0000000000000030&所述; triad_avx_asm_repeat.L1计算值:
  30:48 89 C8 MOV RAX,RCX
  33:48 F7 D8负RAX
  36:90 NOP
  37:90 NOP
  38:90 NOP
  39:90 NOP
  3A:90 NOP
  3B:90 NOP
  3C:90 NOP
  3D:90 NOP
  3E:90 NOP
  3F:90 NOP0000000000000040&所述; triad_avx_asm_repeat.L2计算值:
  40:C5 EC 59 0C 07 vmulps ymm1,ymm2,YMMWORD PTR [RDI + RAX * 1]
  45:C5 F4 58 0C 06 vaddps ymm1,ymm1,YMMWORD PTR [RSI + RAX * 1]
  4A:C5 FC 29 0C 02 vmovaps YMMWORD PTR [RDX + RAX * 1],ymm1
  4F:48 83 20 C0加RAX,为0x20
  53:75 EB JNE 40℃; triad_avx_asm_repeat.L2&GT;
  55:41 83 E8 01分r8d,为0x1
  59:75 D5 JNE 30℃triad_avx_asm_repeat.L1&GT;
  5B:C5 F8 77 vzeroupper
  5E:C3 RET
  5F:90 NOP


解决方案

唉唉,code对准...

code对准的一些基础知识。


  • 大多数Intel架构取16B值得每个时钟周期的指令。

  • 分支predictor有一个更大的窗口,并着眼于典型的一倍,每个时钟周期。这样做是为了出人头地的取指令。

  • 您$​​ C $ C的对齐方式将决定您有可脱code和predict在任何给定的时钟(简单code本地参数)的说明。

  • 在各级大多数现代的Intel架构缓存的说明(无论是在宏观水平的指令解码之前,还是在微观层面的指令解码后)。这消除了code调整的影响,只要你在微/宏缓存执行了。

  • 此外,最现代的英特尔架构有某种形式的循环流检测器的检测回路,再次,执行它们一些缓存绕过前端取机制。

  • 某些Intel架构是挑剔,他们可以缓存什么,以及他们不能。经常有依赖于指令/微指令/校准/分支机构/等号码。对齐可能,在某些情况下,影响什么的缓存,哪些不是,你可以创建情况下,可以填充prevent或导致一个循环来获取缓存。

  • 要使事情更复杂的,指令的地址也由分支predictor使用。它们在几个方面,包括:(1),为查找到一个分支prediction缓冲至predict分枝,(2),其为键/值保持某种形式为$的p分支行为全局状态的使用$ pdiction目的,(3)作为钥匙插入确定间接分支目标等。因此,调整其实可以有分支prediction一个pretty巨大的影响,在某些情况下,由于走样或其他贫困prediction。

  • 一些体系使用指令地址来确定何时prefetch数据,code对准可以与干扰,如果存在,只是在合适的条件。

  • 对齐循环并不总是做一件好事,这取决于code的布局方式(尤其是如果有一个在回路控制流量)。

说了这么多等等等等,您的问题可能是任何其中之一。它看着不只是对象的拆卸,但可执行是很重要的。你想看到的最终地址是什么一切都联系在一起了。使得一个对象的变化,可以链接后会影响另一个对象指令的对齐/地址。

在某些情况下,这几乎是不可能对准你的code以这样的方式,以最大限度地提高性能,只是由于这么多的低水平建筑的行为是难以控制和predict(即不一定这个意思是总是如此)。在某些情况下,最好的办法是有一些默认对齐策略(比如对齐16B的边界,和外环同所有条目),以便最小化您的成效和变化对变化而变化量。作为一个总体战略,调整功能项是好的。调整是比较小的是好的,只要你不是在你的执行路径加入空指令循环。

除此之外,我需要更多的信息/数据,以查明您的具体问题,但认为一些这可能帮助..祝你好运:)

I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment.

Here is a function I have been trying this on a Ivy Bridge system

void triad(float *x, float *y, float *z, int n, int repeat) {
    float k = 3.14159f;
    int(int r=0; r<repeat; r++) {
        for(int i=0; i<n; i++) {
            z[i] = x[i] + k*y[i];
        }
    }
}

The assembly I'm using for this is below. If I don't specify the alignment my performance compared to the peak is only about 90%. However, when I align the code before the loop as well as both inner loops to 16 bytes the performance jumps to 96%. So clearly the code alignment in this case makes a difference.

But here is the strangest part. If I align the innermost loop to 32 bytes it makes no difference in the performance of this function, however, in another version of this function using intrinsics in a separate object file I link in its performance jumps from 90% to 95%!

I did an object dump (using objdump -d -M intel) of the version aligned to 16 bytes (I posted the result to the end of this question) and 32 bytes and they are identical! It turns out that the inner most loop is aligned to 32 bytes anyway in both object files. But there must be some difference.

I did a hex dump of each object file and there is one byte in the object files that differ. The object file aligned to 16 bytes has a byte with 0x10 and the object file aligned to 32 bytes has a byte with 0x20. What exactly is going on! Why does code alignment in one object file affect the performance of a function in another object file? How do I know what is the optimal value to align my code to?

My only guess is that when the code is relocated by the loader that the 32 byte aligned object file affects the other object file using intrinsics. You can find the code to test all this at Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

The NASM code I am using:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

Result from objdump -d -M intel test16.o. The disassembly is identical if I change align 16 to align 32 in the assembly above just before .L2. However, the object files still differ by one byte.

test16.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <pi>:
   0:   d0 0f                   ror    BYTE PTR [rdi],1
   2:   49                      rex.WB
   3:   40 90                   rex xchg eax,eax
   5:   90                      nop
   6:   90                      nop
   7:   90                      nop
   8:   90                      nop
   9:   90                      nop
   a:   90                      nop
   b:   90                      nop
   c:   90                      nop
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop

0000000000000010 <triad_avx_asm_repeat>:
  10:   48 c1 e1 02             shl    rcx,0x2
  14:   48 01 cf                add    rdi,rcx
  17:   48 01 ce                add    rsi,rcx
  1a:   48 01 ca                add    rdx,rcx
  1d:   c4 e2 7d 18 15 da ff    vbroadcastss ymm2,DWORD PTR [rip+0xffffffffffffffda]        # 0 <pi>
  24:   ff ff 
  26:   90                      nop
  27:   90                      nop
  28:   90                      nop
  29:   90                      nop
  2a:   90                      nop
  2b:   90                      nop
  2c:   90                      nop
  2d:   90                      nop
  2e:   90                      nop
  2f:   90                      nop

0000000000000030 <triad_avx_asm_repeat.L1>:
  30:   48 89 c8                mov    rax,rcx
  33:   48 f7 d8                neg    rax
  36:   90                      nop
  37:   90                      nop
  38:   90                      nop
  39:   90                      nop
  3a:   90                      nop
  3b:   90                      nop
  3c:   90                      nop
  3d:   90                      nop
  3e:   90                      nop
  3f:   90                      nop

0000000000000040 <triad_avx_asm_repeat.L2>:
  40:   c5 ec 59 0c 07          vmulps ymm1,ymm2,YMMWORD PTR [rdi+rax*1]
  45:   c5 f4 58 0c 06          vaddps ymm1,ymm1,YMMWORD PTR [rsi+rax*1]
  4a:   c5 fc 29 0c 02          vmovaps YMMWORD PTR [rdx+rax*1],ymm1
  4f:   48 83 c0 20             add    rax,0x20
  53:   75 eb                   jne    40 <triad_avx_asm_repeat.L2>
  55:   41 83 e8 01             sub    r8d,0x1
  59:   75 d5                   jne    30 <triad_avx_asm_repeat.L1>
  5b:   c5 f8 77                vzeroupper 
  5e:   c3                      ret    
  5f:   90                      nop

解决方案

Ahhh, code alignment...

Some basics of code alignment..

  • Most intel architectures fetch 16B worth of instructions per clock.
  • The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
  • How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
  • Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
  • Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
  • Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
  • To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
  • Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
  • Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).

Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.

In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.

Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)

这篇关于在一个对象文件code对齐影响的函数的另一个对象文件中的表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆