如何将两个SSE寄存器加在一起 [英] How can I add together two SSE registers

查看:165
本文介绍了如何将两个SSE寄存器加在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个SSE寄存器(128位是一个寄存器),我想将它们加起来.我知道如何在其中添加相应的单词,例如,如果我在寄存器中使用16位单词,则可以使用_mm_add_epi16做到这一点,但是我想要的是类似_mm_add_epi128的东西(不存在),它将使用寄存器作为一个大词. 即使需要多个指令,也可以执行任何操作吗?
我当时正在考虑使用_mm_add_epi64,检测右字中的溢出,然后根据需要在寄存器的左字中加1,但是我也希望这种方法适用于256位寄存器(AVX2),这种方法似乎太复杂了.为此.

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word. Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.

推荐答案

要添加两个128位数字xy来为z提供SSE,您可以这样做

To add two 128-bit numbers x and y to give z with SSE you can do it like this

z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);

这基于此链接功能unsigned_lessthan在下面定义.如果没有AMD XOP,它会很复杂(实际上,如果XOP不可用,那么它会为SSE4.2找到一个更简单的版本-请参阅我答案的结尾).这里的其他一些人可能会建议一种更好的方法.这是一些显示此功能的代码.

The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>

inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__  // AMD XOP instruction set
    return _mm_comgt_epu64(b,a));
#else  // SSE2 instruction set
    __m128i sign32  = _mm_set1_epi32(0x80000000);          // sign bit of each dword
    __m128i aflip   = _mm_xor_si128(b,sign32);             // a with sign bits flipped
    __m128i bflip   = _mm_xor_si128(a,sign32);             // b with sign bits flipped
    __m128i equal   = _mm_cmpeq_epi32(b,a);                // a == b, dwords
    __m128i bigger  = _mm_cmpgt_epi32(aflip,bflip);        // a > b, dwords
    __m128i biggerl = _mm_shuffle_epi32(bigger,0xA0);      // a > b, low dwords copied to high dwords
    __m128i eqbig   = _mm_and_si128(equal,biggerl);        // high part equal and low part bigger
    __m128i hibig   = _mm_or_si128(bigger,eqbig);          // high part bigger or high part equal and low part
    __m128i big     = _mm_shuffle_epi32(hibig,0xF5);       // result copied to low part
    return big;
#endif
}

int main() {
    __m128i x,y,z,c;
    x = _mm_set_epi64x(3,0xffffffffffffffffll);
    y = _mm_set_epi64x(1,0x2ll);
    z = _mm_add_epi64(x,y);
    c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
    z = _mm_sub_epi64(z,c);

    int out[4];
    //int64_t out[2];
    _mm_storeu_si128((__m128i*)out, z);
    printf("%d %d\n", out[2], out[0]);
}

使用SSE来添加128位或256位数字的唯一可能有效的方法是使用XOP. AVX的唯一选项是XOP2,它尚不存在.即使使用XOP,并行添加两个128位或256个数字(如果存在XOP2时也可以使用AVX进行四个操作)只能有效避免类似mm_unpacklo_epi64这样的水平指令.

The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.

通常最好的解决方案是将寄存器压入堆栈并使用标量算法.假设您有两个256位寄存器x4和y4,可以这样添加它们:

The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:

__m256i x4, y4, z4;

uint64_t x[4], uint64_t y[4], uint64_t z[4]    
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);

void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
    uint64_t c1 = 0, c2 = 0, tmp;
    //add low 128-bits
    z[0] = x[0] + y[0];
    z[1] = x[1] + y[1];
    c1 += z[1]<x[1];
    tmp = z[1];
    z[1] += z[0]<x[0];
    c1 += z[1]<tmp;
    //add high 128-bits + carry from low 128-bits
    z[2] = x[2] + y[2];
    c2 += z[2]<x[2];
    tmp = z[2];
    z[2] += c1;
    c2 += z[2]<tmp; 
    z[3] = x[3] + y[3] + c2;
}

int main() {
    uint64_t x[4], y[4], z[4];
    x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,1,0)
    //x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
    //y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
    //z = x + y  (x3,x2,x1,x0) = (2,3,0,0)
    add_u256(x,y,z);
    for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}

基于Stephen Canon在 saturated-substraction-avx上的评论-or-sse4-2 我发现,如果没有XOP,有一种更有效的方法可以将未签名的64位数字与SSE4.2进行比较.

based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.

__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

这篇关于如何将两个SSE寄存器加在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆