如何交换两个__m128i变量在C ++ 03给定它的不透明类型和数组? [英] How to swap two __m128i variables in C++03 given its an opaque type and an array?

查看:427
本文介绍了如何交换两个__m128i变量在C ++ 03给定它的不透明类型和数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

交换 __ m128i 变量的最佳做法是什么?



背景是 Sun Studio 12.2 下的编译错误,它是一个C ++ 03编译器。 __ m128i 是与MMX和SSE指令一起使用的不透明类型,通常和 unsigned long long [2] 。 C ++ 03不支持交换数组,并且在编译器下 std:swap(__ m128i a,__m128i b)失败。






这里有一些相关的问题没有达到标准。它们不适用,因为 std :: vector 不可用。




解决方案

这听起来不是一个最佳实践的问题;它听起来像你需要一个解决方案,严重破坏的内在函数实现。如果 __ m128i tmp = a; 不编译,那很糟糕。





$ b b

如果要编写自定义交换函数,请保持简单。 __ m128i 是一种适合单一向量寄存器的POD类型。不要做任何会鼓励编译器将其溢出到内存的东西。一些编译器会产生很可怕的代码,即使是微不足道的测试用例,甚至GCC /铛可能超过memcpy的旅行作为优化大复杂功能的一部分。



由于编译器会阻塞构造函数,只需使用正常的初始化程序声明一个tmp变量,并使用 = 赋值来进行复制。这在任何支持 __ m128i 的编译器中总是有效的,并且是一种常见的模式。



内存中的值类似于 _mm_store_si128 / _mm_load_si128 :ie movdqa aligned stores / loads 如果在未对齐的地址上使用将会出错。

(当然,可能会导致负载优化得到折叠成记忆的操作数到另一个向量指令,或专卖店没有发生在所有。)

  // alternate names:assignment_swap 
//或swap128,但是这个名称不适合__m256i ...

// __m128i t(a)错误,因此只需使用简单initializers / assignment
template< class T>
void vecswap(T& a,T& b){
// T t = a; //显然SunCC甚至窒息这
T t;
t = a;
a = b;
b = t;
}

测试用例:即使使用像ICC13这样的强大编译器,工作与memcpy版本。 asm输出从 Godbolt编译器资源管理器,with icc13 -O3

  __ m128i test_return2nd(__ m128i x,__m128i y){
vecswap(x,y);
return x;
}

MOVDQA XMM0,xmm1中的
RET#返回第二ARG,这是在将xmm1


__m128i test_return1st(__ m128i X, __m128i y){
vecswap(x,y);
return y;
}

ret#返回第一个arg,已在xmm0

使用memswap,你会得到类似

  return1st_memcpy(__ m128i,__m128i):## ICC13 -O3 
movdqa XMMWORD PTR [-56 + rsp],xmm0
movdqa XMMWORD PTR [-40 + rsp],xmm1#spill both
movaps xmm2,XMMWORD PTR [-56 + rsp]#reload x
MOVAPS XMMWORD PTR [-24 + RSP],XMM2#拷贝X要tmp下
MOVAPS XMM0,XMMWORD PTR [-40 + RSP]#重装Ÿ
MOVAPS XMMWORD PTR [-56 + RSP],XMM0#复制Y到X
MOVAPS XMM0,XMMWORD PTR [-24 + RSP]#重装TMP
MOVAPS XMMWORD PTR [-40 + RSP],XMM0#tmp目录复制到y
MOVDQA XMM0,XMMWORD PTR [-40 + rsp]#reload y
ret#return y

很多溢出/重装你能想象交换两个寄存器,因为icc13不优化之间的绝对量最大的内联的memcpy ■在所有的,甚至记得还剩下什么



$ b 甚至gcc使memcpy版本更糟糕的代码。它使用64位整数加载/存储而不是128位向量加载/存储进行复制。这是可怕的,如果你要加载向量(存储转发失速),否则只是坏(更多的uops做同样的工作)。

  //这个编译的memcpy版本很糟糕
void test_mem(__ m128i * x,__m128i * y){
vecswap(* x,* y);
}
#GCC 5.3和ICC13做出同样的代码在这里,因为它很容易优化
MOVDQA XMM0,XMMWORD PTR [RDI]
MOVDQA将xmm1,XMMWORD PTR [RSI]
movaps XMMWORD PTR [rdi],xmm1
movaps XMMWORD PTR [rsi],xmm0
ret

// gcc 5.3使用memswap而不是vecswap。 ICC13类似
test_mem_memcpy(long long __vector(2)*,long long __vector(2)*):
mov rax,QWORD PTR [rdi]
mov rdx,QWORD PTR [rdi + 8]
mov r9,QWORD PTR [rsi]
mov r10,QWORD PTR [rsi + 8]
mov QWORD PTR [rdi],r9
mov QWORD PTR [rdi + 8],r10
mov QWORD PTR [rsi],rax
mov QWORD PTR [rsi + 8],rdx
ret


What is the best practice for swapping __m128i variables?

The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b) fails under the compiler.


Here are some related questions that don't quite hit the mark. They don't apply because std::vector is not available.

解决方案

This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a; doesn't compile, that's pretty bad.


If you're going to write a custom swap function, keep it simple. __m128i is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.

Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use = assignment to do the copying. That always works efficiently in any compiler that supports __m128i, and is a common pattern.

Plain assignment to/from values in memory works like _mm_store_si128 / _mm_load_si128: i.e. movdqa aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)

// alternate names: assignment_swap
// or swap128, but then the name doesn't fit for __m256i...

// __m128i t(a) errors, so just use simple initializers / assignment
template<class T>
void vecswap(T& a, T& b) {
    // T t = a;     // Apparently SunCC even choked on this
    T t;
    t = a;
    a = b;
    b = t;
}

Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3

__m128i test_return2nd(__m128i x, __m128i y) {
    vecswap(x, y);
    return x;
}

    movdqa    xmm0, xmm1
    ret                    # returning the 2nd arg, which was in xmm1


__m128i test_return1st(__m128i x, __m128i y) {
    vecswap(x, y);
    return y;
}

    ret                   # returning the first arg, already in xmm0

With memswap, you get something like

return1st_memcpy(__m128i, __m128i):        ## ICC13 -O3
    movdqa    XMMWORD PTR [-56+rsp], xmm0
    movdqa    XMMWORD PTR [-40+rsp], xmm1    # spill both
    movaps    xmm2, XMMWORD PTR [-56+rsp]    # reload x
    movaps    XMMWORD PTR [-24+rsp], xmm2    # copy x to tmp
    movaps    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    movaps    XMMWORD PTR [-56+rsp], xmm0    # copy y to x
    movaps    xmm0, XMMWORD PTR [-24+rsp]    # reload tmp
    movaps    XMMWORD PTR [-40+rsp], xmm0    # copy tmp to y
    movdqa    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    ret                                      # return y

This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpys at all, or even remember what is left in a register.


Swapping values already in memory

Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).

// the memcpy version of this compiles badly
void test_mem(__m128i *x, __m128i *y) {
    vecswap(*x, *y);
}
    # gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
    movdqa  xmm0, XMMWORD PTR [rdi]
    movdqa  xmm1, XMMWORD PTR [rsi]
    movaps  XMMWORD PTR [rdi], xmm1
    movaps  XMMWORD PTR [rsi], xmm0
    ret

// gcc 5.3 with memswap instead of vecswap.  ICC13 is similar
test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
    mov     rax, QWORD PTR [rdi]
    mov     rdx, QWORD PTR [rdi+8]
    mov     r9, QWORD PTR [rsi]
    mov     r10, QWORD PTR [rsi+8]
    mov     QWORD PTR [rdi], r9
    mov     QWORD PTR [rdi+8], r10
    mov     QWORD PTR [rsi], rax
    mov     QWORD PTR [rsi+8], rdx
    ret

这篇关于如何交换两个__m128i变量在C ++ 03给定它的不透明类型和数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆