代码对齐对汇编中的主循环定时的影响 [英] The effect of code alignment in timing main loops in assembly

查看:133
本文介绍了代码对齐对汇编中的主循环定时的影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下主要循环

.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2

我将其计时的方法是将其放入另一个这样的长循环中

The way I would time this is to put it in another long loop like this

;align 32              
.L1:
    mov             rax, rcx
    neg             rax
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1                 ; r8 contains a large integer
    jnz             .L1

我发现的是,我选择的对齐方式会对时序产生重大影响(最高+ -10%).我不清楚如何选择代码对齐方式.我可以想到三个地方来对齐代码

What I'm finding is that the alignment I choose can have a significant effect on the timing (up to +-10%). It's not clear to me how to choose the code alignment. There are three places I can think of where I might want to align the code

  1. 在函数入口处(例如,参见下面的代码中的triad_fma_asm_repeat)
  2. 在外循环的开始(上面的.L1)重复了我的主循环
  3. 在我的主循环开始时(上面的.L2).
  1. At the entry to the function (see e.g. triad_fma_asm_repeat in the code below)
  2. At the start of the outer loop (.L1 above) which repeats my main loop
  3. At the start of my main loop (.L2 above).

我发现的另一件事是,如果我在源文件中放入另一个例程,则更改一条指令(例如删除一条指令)可能会对下一个功能的时序产生重大影响,即使它们是独立的功能也是如此. 我什至看到过去这会影响另一个目标文件中的例程.

Another things I have found is that if I put another routine in my source file that changing one instruction (e.g. removing an instruction) can have a significant effect on the timing of the next function even when they are independent functions. I have even seen this in the past affect a routine in another object file.

我已阅读 Agner Fog的优化装配手册但是,对于我来说,仍不清楚如何对齐我的代码以测试性能的最佳方法.他举了一个11.5的例子,它为我不真正遵循的内部循环计时.

I have read section 11.5 "Alignment of code" in Agner Fog's optimizing assembly manual but it's still not clear to me the best way to align my code for testing performance. He give an example, 11.5, of timing an inner loop which I don't really follow.

当前从我的代码中获得最高性能的是一个猜测不同值和对齐位置的游戏.

Currently getting the highest performance from my code is a game of guessing different values and locations of alignment.

我想知道是否有一种聪明的方法来选择对齐方式?我应该对齐内环和外环吗?只是内循环?该函数的入口也是如此?使用短或长的NOP有关系吗?

I would like to know if there is an intelligent method to choose the alignment? Should I align the inner and outerloop? Just the inner loop? The entry to the function as well? Do using short or long NOPs matter?

我对Haswell最感兴趣,其次是SNB/IVB,然后是Core2.

I'm mostly interested in Haswell, followed by SNB/IVB, and then Core2.

我尝试了NASM和YASM,发现这是它们之间显着不同的领域. NASM仅插入一个字节的NOP指令,而YASM则插入多字节的NOP.例如,通过将上面的内部循环和外部循环对齐到32个字节,NASM插入了20条NOP(0x90)指令,其中YASM插入了以下内容(来自objdump)

I have tried both NASM and YASM and have discovered that this is one area where they differ significantly. NASM only inserts one byte NOP instructions where YASM inserts multi-byte NOP. For example by aligning both the the inner and outer loop above to 32 bytes NASM inserted 20 NOP (0x90) instructions where as YASM inserted the following (from objdump)

  2c:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nopw  %cs:0x0(%rax,%rax,1)
  33:   0f 1f 84 00 00 00 00 
  3a:   00 
  3b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

到目前为止,我还没有观察到这种性能的显着差异.看起来对齐方式与指令长度无关紧要.但是,Agner在对齐代码部分中写道:

So far I have not observed a significant difference in performance with this. It appears that it's alignment that matters not the instruction length. But Agner writes in the aligning code section:

使用更长的指令什么也不做比使用很多单字节NOP指令更有效.

It is more efficient to use longer instructions that do nothing than to use a lot of single-byte NOP's.


如果您想使用对齐方式并自己看下面的效果,则可以找到我使用的汇编代码和C代码.将double frequency = 3.6替换为CPU的有效频率.您可能要禁用涡轮增压.


If you want to play with the alignment and see the effects yourself bellow you can find both the assembly and C code I use. Replace double frequency = 3.6 with the effective frequency of your CPU. You may want to disable turbo.

;nasm/yasm -f elf64 align_asm.asm`
global triad_fma_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]
pi: dd 3.14159

section .text
align 16
triad_fma_asm_repeat:

    shl             rcx, 2
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx

;align 32
.L1:
    mov             rax, rcx
    neg             rax
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

global triad_fma_store_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
;z[i] = y[i] + 3.14159*x[i]

align 16
    triad_fma_store_asm_repeat:
    shl             rcx, 2
    add             rcx, rdx
    sub             rdi, rdx
    sub             rsi, rdx
    vbroadcastss    ymm2, [rel pi]

;align 32
.L1:
    mov             r9, rdx
align 32
.L2:
    vmulps          ymm1, ymm2, [rdi+r9]
    vaddps          ymm1, ymm1, [rsi+r9]
    vmovaps         [r9], ymm1
    add             r9, 32
    cmp             r9, rcx
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

这是我用来调用汇编例程并为其计时的C代码

Here is the C code I use to call the assembly routines and time them

//gcc -std=gnu99 -O3        -mavx align.c -lgomp align_asm.o -o align_avx
//gcc -std=gnu99 -O3 -mfma -mavx2 align.c -lgomp align_asm.o -o align_fma
#include <stdio.h>
#include <string.h>
#include <omp.h>

float triad_fma_asm_repeat(float *x, float *y, float *z, const int n, int repeat);
float triad_fma_store_asm_repeat(float *x, float *y, float *z, const int n, int repeat);

float triad_fma_repeat(float *x, float *y, float *z, const int n, int repeat)
{
    float k = 3.14159f;
    int r;
    for(r=0; r<repeat; r++) {
        int i;
        __m256 k4 = _mm256_set1_ps(k);
        for(i=0; i<n; i+=8) {
            _mm256_store_ps(&z[i], _mm256_add_ps(_mm256_load_ps(&x[i]), _mm256_mul_ps(k4, _mm256_load_ps(&y[i]))));
        }
    }
}

int main (void )
{
    int bytes_per_cycle = 0;
    double frequency = 3.6;
    #if (defined(__FMA__))
    bytes_per_cycle = 96;
    #elif (defined(__AVX__))
    bytes_per_cycle = 48;
    #else
    bytes_per_cycle = 24;
    #endif
    double peak = frequency*bytes_per_cycle;

    const int n =2048;

    float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
    char *mem = (char*)_mm_malloc(1<<18,4096);
    char *a = mem;
    char *b = a+n*sizeof(float);
    char *c = b+n*sizeof(float);

    float *x = (float*)a;
    float *y = (float*)b;
    float *z = (float*)c;

    for(int i=0; i<n; i++) {
        x[i] = 1.0f*i;
        y[i] = 1.0f*i;
        z[i] = 0;
    }
    int repeat = 1000000;    
    triad_fma_repeat(x,y,z2,n,repeat);   

    while(1) {
        double dtime, rate;

        memset(z, 0, n*sizeof(float));
        dtime = -omp_get_wtime();
        triad_fma_asm_repeat(x,y,z,n,repeat);
        dtime += omp_get_wtime();
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("t1     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));

        memset(z, 0, n*sizeof(float));
        dtime = -omp_get_wtime();
        triad_fma_store_asm_repeat(x,y,z,n,repeat);
        dtime += omp_get_wtime();
        rate = 3.0*1E-9*sizeof(float)*n*repeat/dtime;
        printf("t2     rate %6.2f GB/s, efficency %6.2f%%, error %d\n", rate, 100*rate/peak, memcmp(z,z2, sizeof(float)*n));

        puts("");
    }
}


NASM手册

最后的警告:ALIGN和ALIGNB相对于该部分的开头而不是最终可执行文件中地址空间的开头进行工作.例如,当您仅保证将所在节与4个字节的边界对齐时,将其与16个字节的边界对齐会浪费很多精力.同样,NASM不会检查截面的对齐特性是否适合使用ALIGN或ALIGNB.

A final caveat: ALIGN and ALIGNB work relative to the beginning of the section, not the beginning of the address space in the final executable. Aligning to a 16-byte boundary when the section you're in is only guaranteed to be aligned to a 4-byte boundary, for example, is a waste of effort. Again, NASM does not check that the section's alignment characteristics are sensible for the use of ALIGN or ALIGNB.

我不确定代码段获取的是绝对的32字节对齐地址还是仅是相对的地址.

I'm not sure the code segment is getting an absolute 32-byte aligned address or only a relative one.

推荐答案

关于您的最后一个有关相对(节内)对齐和绝对(运行时在内存中)的问题-您不必太担心.在您引用的手册部分警告ALIGN未检查部分对齐的下面,您有以下内容:

Regarding your last question about relative (within-section) alignment and absolute (in memory at runtime) - you don't have to worry too much. Just below the section of the manual you quoted which warns about ALIGN not checking the section alignment, you have this:

ALIGN和ALIGNB都隐式调用SECTALIGN宏.有关详细信息,请参见 4.11.13 部分.

因此,基本上ALIGN不会检查对齐是否合理,但是会调用SECTALIGN宏,以使对齐有意义.特别是,所有隐式SECTALIGN调用都应确保该节与任何align调用指定的最大对齐方式对齐.

So basically ALIGN doesn't check that the alignment is sensible, but it does call the SECTALIGN macro so that the alignment will be sensible. In particular, all the implicit SECTALIGN calls should insure that the section is aligned to the largest alignment specified by any align call.

关于ALIGN不检查的警告可能仅适用于更晦涩的情况,例如,当组装成不支持节对齐的格式,指定比节所支持的对齐更大的格式时或已被调用以禁用SECTALIGN.

The warning about ALIGN not checking then probably only applies to more obscure cases, e.g., when assembling into formats that don't support section alignment, when specifying an alignment larger than that supported by a section, or when SECTALIGN OFF has been called to disable SECTALIGN.

这篇关于代码对齐对汇编中的主循环定时的影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆