在循环使用内联汇编阵列 [英] Looping over arrays with inline assembly

查看:161
本文介绍了在循环使用内联汇编阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当遍历与内联汇编阵列我应该使用寄存器修饰符R或他的内存修改器M?

让我们考虑它增加了一个例子两条浮法阵列 X ,并将结果写入以Z 。通常我会用内部函数来做到这一点像这样

 的for(int i = 0; I< N / 4;我++){
    __m128 X4 = _mm_load_ps(安培; X [4 * i]);
    __m128 Y4 = _mm_load_ps(安培; Y [4 * i]);
    __m128 S = _mm_add_ps(X4,Y4);
    _mm_store_ps(安培; Z [4 * I],S);
}

下面是内联汇编解决方案,我想出了使用寄存器修饰符R

 无效add_asm1(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
            MOVAPS(%1 %% RAX,4),%% XMM0 \\ n
            ADDPS(%2 %% RAX,4),%% XMM0 \\ n
            MOVAPS %% XMM0,(%0,%% RAX,4)\\ n
            :
            :R(Z),R(Y),R(x)中,一(一)
            :
        );
    }
}

这会产生类似的组件GCC。的主要区别在于,GCC增加16到索引寄存器,并使用比例为1,而内联组件解决方案增加了4到索引寄存器并且使用4比例

我是不是能够使用通用寄存器的迭代器。我不得不指定一个在这种情况下是 RAX 。有没有道理呢?

下面是我想出了使用内存修改器解决方案M

 无效add_asm2(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
            MOVAPS%1 %% XMM0 \\ n
            ADDPS%2 %% XMM0 \\ n
            MOVAPS %% XMM0,%0 \\ n
            := M(Z [I])
            :M(值Y [i]),M(X [I])
            :
            );
    }
}

这是低效率的,因为它不使用索引寄存器,而不是具有16添加到每个阵列的基址寄存器。生成的程序集是(海湾合作委员会(Ubuntu的5.2.1-22ubuntu2)与 GCC -O3 -S asmtest.c

  .L22
    MOVAPS(%RSI),%XMM0
    ADDPS(%RDI),%XMM0
    MOVAPS%XMM0,(%的RDX)
    ADDL $ 4,%EAX
    addq $ 16%的RDX
    addq $ 16%RSI
    addq $ 16%RDI
    CMPL%EAX,ECX%
    JA .L22

有没有使用的内存修改器M更好的解决方案?是否有某种方式得到它使用一个索引寄存器?我问的原因是,它似乎更符合逻辑我用修改器内存米,因为我读出和写入内存。此外,与寄存器调节剂R我从来没有使用这似乎奇怪,我在第一次输出操作数列表。

也许有比使用R或M?

一个更好的解决方案

下面是完整的code我用来测试这个

 的#include<&stdio.h中GT;
#包括LT&;&x86intrin.h GT;#定义N 64无效add_intrin(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N,I + = 4){
        __m128 X4 = _mm_load_ps(安培; X [I]);
        __m128 Y4 = _mm_load_ps(安培; Y [I]);
        __m128 S = _mm_add_ps(X4,Y4);
        _mm_store_ps(安培; Z [I],S);
    }
}无效add_intrin2(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N / 4;我++){
        __m128 X4 = _mm_load_ps(安培; X [4 * i]);
        __m128 Y4 = _mm_load_ps(安培; Y [4 * i]);
        __m128 S = _mm_add_ps(X4,Y4);
        _mm_store_ps(安培; Z [4 * I],S);
    }
}无效add_asm1(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
            MOVAPS(%1 %% RAX,4),%% XMM0 \\ n
            ADDPS(%2 %% RAX,4),%% XMM0 \\ n
            MOVAPS %% XMM0,(%0,%% RAX,4)\\ n
            :
            :R(Z),R(Y),R(x)中,一(一)
            :
        );
    }
}无效add_asm2(浮点* X,浮动* Y,浮动* Z,无符号N){
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
            MOVAPS%1 %% XMM0 \\ n
            ADDPS%2 %% XMM0 \\ n
            MOVAPS %% XMM0,%0 \\ n
            := M(Z [I])
            :M(值Y [i]),M(X [I])
            :
            );
    }
}诠释主要(无效){
    浮X [N],Y [N],Z1 [N],Z2 [N],Z3 [N];
    的for(int i = 0; I< N;我++)×[我] = 1.0F,Y [I] = 2.0F;
    add_intrin2(X,Y,Z 1,N);
    add_asm1(X,Y,Z 2,N);
    add_asm2(X,Y,Z3,N);
    的for(int i = 0; I< N;我++)的printf(%0F,Z1 [I]);卖出期权();
    的for(int i = 0; I< N;我++)的printf(%0F,Z2 [I]);卖出期权();
    的for(int i = 0; I< N;我++)的printf(%0F,Z3 [I]);卖出期权();
}


解决方案

您可以使用研究约束的索引。使用修改,以获得64位寄存器的名字,所以你可以在寻址模式下使用。当编译为32位的目标,在修改选择32位寄存器的名字,所以同样code仍然有效。

如果你想选择使用什么样的寻址方式,你需要自己做,使用指针运算与研究的约束。似乎没有成为一个方式告诉,只有内存的具体32B被破坏的编译器,避免了一个完整的记忆撞地。

年底的例子则会覆盖在docs栏目是输入操作数,而不是撞的约束。 (我觉得我前一阵子发布的答案给了那个语法没有测试一个撞地。哎呀。)

要使用 M 约束为输出的好处,至少是避免需要一个完整的内存撞(强制编译器重新加载所有全局)。然而,GCC不检查内联汇编是否甚至使用的约束。使用虚拟 M 输出操作数导致海湾合作委员会发出指令来获取一个地址寄存器,因为它选择不只是使用输入操作数与索引寻址模式。


的另一个巨大的好处到 M 的约束是 -funroll-循环可以正常工作通过不断的产生偏移地址。在做解决我们自己prevents从做每4次迭代或某事的单一增量编译,因为的每一个源级别的值i 需要出现在寄存器中。


下面是我的版本,以在评论中所指出的一些调整。

 的#include< immintrin.h>
无效add_asm1(浮点* X,浮动* Y,浮动* Z,无符号N){
    __m128 vectmp; //让编译器选择一个暂存寄存器
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
            MOVAPS(%[Y],%Q [IDX] 4)%[vectmp]的\\ n \\ t// q修饰:一个GP的Reg 64位版本
            ADDPS(%[X]%Q [IDX] 4)%[vectmp]的\\ n \\ t的
            MOVAPS%[vectmp](%[Z],%Q [IDX] 4)\\ n \\ t的
            :[vectmp]= X(vectmp)//= M(Z [I])//给出糟糕code。如果编译器prepares我们不使用一个reg
            :[Z]R(Z),[Y]R(Y),[X]R(X)
              [IDX]R(我)//展开是不可能通过这种方式(不考虑4每增加一条指令)
            :记忆
          //我认为这是可能告诉GCC * *这是内存惨败,但这似乎是不可能的
        );
    }
}

godbolt

您的版本需要声明%XMM0 为被破坏,或者你将有一个不好的时候,这是内联。我的版本使用临时变量作为从未使用一个只输出操作数。这给出了寄存器分配编译器充分的自由。


一个版本 M 的限制,的GCC可以解开

 的#include< immintrin.h>
无效add_asm1(浮点* X,浮动* Y,浮动* Z,无符号N){
    __m128 vectmp; //让编译器选择一个暂存寄存器
    的for(int i = 0; I< N,I + = 4){
        __asm​​__ __volatile__(
           //MOVAPS%[易]%[vectmp]的\\ n \\ t的
            ADDPS%[喜]%[vectmp]的\\ n \\ t//我们要求将%[易]输入在同一个寄存器作为[vectmp]虚设输出
            MOVAPS%[vectmp]%[紫] \\ n \\ t的
          //丑丑型双关蒙上,但它确实发生了与海湾合作委员会的工作
            :[vectmp]= x的(vectmp),[滋]= M(*(__ M128 *)及Z [I])
            :[易]0(*(__ M128 *)及Y [I])//或[易]XM(*(__ M128 *)及Y [I]),并取消了MOVAPS负荷
            [喜]XM(*(__ M128 *)及X [I])
            ://不需要记忆撞
        );
    }
}

使用 [易] + X 输入/输出操作数是简单,但写作是这样使得在联汇编在取消的负载,而不是让编译器得到一个值到寄存器对我们是一个小的变化。

When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?

Let's consider an example which adds two float arrays x, and y and writes the results to z. Normally I would use intrinsics to do this like this

for(int i=0; i<n/4; i++) {
    __m128 x4 = _mm_load_ps(&x[4*i]);
    __m128 y4 = _mm_load_ps(&y[4*i]);
    __m128 s = _mm_add_ps(x4,y4);
    _mm_store_ps(&z[4*i], s);
}

Here is the inline assembly solution I have come up with using the register modifier "r"

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

This generates similar assembly to GCC. The main difference is that GCC adds 16 to the index register and uses a scale of 1 whereas the inline-assembly solution adds 4 to the index register and uses a scale of 4.

I was not able to use a general register for the iterator. I had to specify one which in this case was rax. Is there a reason for this?

Here is the solution I came up with using the memory modifer "m"

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

This is less efficient as it does not use an index register and instead has to add 16 to the base register of each array. The generated assembly is (gcc (Ubuntu 5.2.1-22ubuntu2) with gcc -O3 -S asmtest.c):

.L22
    movaps   (%rsi), %xmm0
    addps    (%rdi), %xmm0
    movaps   %xmm0, (%rdx)
    addl    $4, %eax
    addq    $16, %rdx
    addq    $16, %rsi
    addq    $16, %rdi
    cmpl    %eax, %ecx
    ja      .L22

Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register? The reason I asked is that it seemed more logical to me to use the memory modifer "m" since I am reading and writing memory. Additionally, with the register modifier "r" I never use an output operand list which seemed odd to me at first.

Maybe there is a better solution than using "r" or "m"?

Here is the full code I used to test this

#include <stdio.h>
#include <x86intrin.h>

#define N 64

void add_intrin(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __m128 x4 = _mm_load_ps(&x[i]);
        __m128 y4 = _mm_load_ps(&y[i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[i], s);
    }
}

void add_intrin2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i++) {
        __m128 x4 = _mm_load_ps(&x[4*i]);
        __m128 y4 = _mm_load_ps(&y[4*i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[4*i], s);
    }
}

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

int main(void) {
    float x[N], y[N], z1[N], z2[N], z3[N];
    for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
    add_intrin2(x,y,z1,N);
    add_asm1(x,y,z2,N);
    add_asm2(x,y,z3,N);
    for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}

解决方案

You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints. There doesn't seem to be a way to tell the compiler that only a specific 32B of memory was clobbered, avoiding a full "memory" clobber.

The example at the end of the Clobbers section in the docs is for an input operand, not a clobber constraint. (I think I posted an answer a while ago that gave that syntax as a clobber without testing. Oops.)

The benefit to using an m constraint for the output at least would be to avoid needing a full memory clobber (forcing the compiler to reload all globals). However, gcc doesn't check whether the inline asm even uses a constraint. Using a dummy m output operand leads gcc to emit instructions to get an address into a register, because it chooses not to just use the input operands with an indexed addressing mode.


Another huge benefit to a m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.


Here's my version, with some tweaks as noted in comments.

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // I thought it was possible to tell gcc *which* memory was clobbered, but that seems impossible
        );
    }
}

godbolt.

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.


A version with m constraints, that gcc can be unroll:

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"
            "addps    %[xi], %[vectmp]\n\t"  // We requested that the %[yi] input be in the same register as the [vectmp] dummy output
            "movaps   %[vectmp], %[zi]\n\t"
          // ugly ugly type-punning casts, but it does happen to work with gcc
            : [vectmp] "=x" (vectmp), [zi] "=m" (*(__m128*)&z[i])
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
            , [xi] "xm" (*(__m128*)&x[i])
            :  // memory clobber not needed
        );
    }
}

Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

这篇关于在循环使用内联汇编阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆