SIMD 对 64 位 * 64 位到 128 位的无符号乘法进行签名 [英] SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit

查看:43
本文介绍了SIMD 对 64 位 * 64 位到 128 位的无符号乘法进行签名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个使用 SIMD 执行 64 位 * 64 位到 128 位的函数.目前我已经使用 SSE2(实际上是 SSE4.1)实现了它.这意味着它同时做两个 64b*64b 到 128b 的产品.同样的想法可以扩展到 AVX2 或 AVX512,同时提供四个或八个 64b*64 到 128b 产品.我的算法基于 http://www.hackersdelight.org/hdcodetxt/muldws.c.txt

I have created a function which does 64-bit * 64-bit to 128-bit using SIMD. Currently I have implemented it using SSE2 (acutally SSE4.1). This means it does two 64b*64b to 128b products at the same time. The same idea could be extended to AVX2 or AVX512 giving four or eight 64b*64 to 128b products at the same time. I based my algorithm on http://www.hackersdelight.org/hdcodetxt/muldws.c.txt

该算法执行一次无符号乘法、一次有符号乘法和两次有符号 * 无符号乘法.使用 _mm_mul_epi32_mm_mul_epu32 可以轻松执行有符号 * 有符号和无符号 * 无符号操作.但是混合签名和未签名的产品给我带来了麻烦.例如考虑一下.

That algorithm does one unsigned multiplication, one signed multiplication, and two signed * unsigned multiplications. The signed * signed and unsigned * unsigned operations are easy to do using _mm_mul_epi32 and _mm_mul_epu32. But the mixed signed and unsigned products caused me trouble. Consider for example.

int32_t x = 0x80000000;
uint32_t y = 0x7fffffff;
int64_t z = (int64_t)x*y;

双字积应该是0xc000000080000000.但是如果你假设你的编译器确实知道如何处理混合类型,你怎么能得到这个呢?这是我想出的:

The double word product should be 0xc000000080000000. But how can you get this if you assume your compiler does know how to handle mixed types? This is what I came up with:

int64_t sign = x<0; sign*=-1;        //get the sign and make it all ones
uint32_t t = abs(x);                 //if x<0 take two's complement again
uint64_t prod = (uint64_t)t*y;       //unsigned product
int64_t z = (prod ^ sign) - sign;    //take two's complement based on the sign

使用 SSE 可以这样做

Using SSE this can be done like this

__m128i xh;    //(xl2, xh2, xl1, xh1) high is signed, low unsigned
__m128i yl;    //(yh2, yl2, yh2, yl2)
__m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh); // get sign
        xs     = _mm_shuffle_epi32(xs, 0xA0);              // extend sign
__m128i t      = _mm_sign_epi32(xh,xh);                    // abs(xh)
__m128i prod   = _mm_mul_epu32(t, yl);                     // unsigned (xh2*yl2,xh1*yl1)
__m128i inv    = _mm_xor_si128(prod,xs);                   // invert bits if negative
__m128i z      = _mm_sub_epi64(inv,xs);                    // add 1 if negative

这给出了正确的结果.但是我必须这样做两次(一次平方时),现在它是我功能的重要组成部分.使用 SSE4.2、AVX2(四个 128 位产品)甚至 AVX512(八个 128 位产品)是否有更有效的方法?

This gives the correct result. But I have to do this twice (once when squaring) and it's now a significant fraction of my function. Is there a more efficient way of doing this with SSE4.2, AVX2 (four 128bit products), or even AVX512 (eight 128bit products)?

也许有比 SIMD 更有效的方法?要得到大字,需要大量的计算.

Maybe there are more efficient ways of doing this than with SIMD? It's a lot of calculations to get the upper word.

根据@ElderBug 的评论,看起来这样做的方法不是使用 SIMD,而是使用 mul 指令.对于它的价值,如果有人想看看这有多复杂,这里是完整的工作功能(我刚刚开始工作,所以我没有优化它,但我认为它不值得).

based on the comment by @ElderBug it looks like the way to do this is not with SIMD but with the mul instruction. For what it's worth, in case anyone want's to see how complicated this is, here is the full working function (I just got it working so I have not optimized it but I don't think it's worth it).

void muldws1_sse(__m128i x, __m128i y, __m128i *lo, __m128i *hi) {
    __m128i lomask = _mm_set1_epi64x(0xffffffff);

    __m128i xh     = _mm_shuffle_epi32(x, 0xB1);    // x0l, x0h, x1l, x1h
    __m128i yh     = _mm_shuffle_epi32(y, 0xB1);    // y0l, y0h, y1l, y1h

    __m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh);
    __m128i ys     = _mm_cmpgt_epi32(_mm_setzero_si128(), yh);
            xs     = _mm_shuffle_epi32(xs, 0xA0);
            ys     = _mm_shuffle_epi32(ys, 0xA0);

    __m128i w0     = _mm_mul_epu32(x,  y);          // x0l*y0l, y0l*y0h
    __m128i w3     = _mm_mul_epi32(xh, yh);         // x0h*y0h, x1h*y1h
            xh     = _mm_sign_epi32(xh,xh);
            yh     = _mm_sign_epi32(yh,yh);

    __m128i w1     = _mm_mul_epu32(x,  yh);         // x0l*y0h, x1l*y1h
    __m128i w2     = _mm_mul_epu32(xh, y);          // x0h*y0l, x1h*y0l

    __m128i yinv   = _mm_xor_si128(w1,ys);          // invert bits if negative
            w1     = _mm_sub_epi64(yinv,ys);         // add 1
    __m128i xinv   = _mm_xor_si128(w2,xs);          // invert bits if negative
            w2     = _mm_sub_epi64(xinv,xs);         // add 1

    __m128i w0l    = _mm_and_si128(w0, lomask);
    __m128i w0h    = _mm_srli_epi64(w0, 32);

    __m128i s1     = _mm_add_epi64(w1, w0h);         // xl*yh + w0h;
    __m128i s1l    = _mm_and_si128(s1, lomask);      // lo(wl*yh + w0h);
    __m128i s1h    = _mm_srai_epi64(s1, 32);

    __m128i s2     = _mm_add_epi64(w2, s1l);         //xh*yl + s1l
    __m128i s2l    = _mm_slli_epi64(s2, 32);
    __m128i s2h    = _mm_srai_epi64(s2, 32);           //arithmetic shift right

    __m128i hi1    = _mm_add_epi64(w3, s1h);
            hi1    = _mm_add_epi64(hi1, s2h);

    __m128i lo1    = _mm_add_epi64(w0l, s2l);
    *hi = hi1;
    *lo = lo1;
}

情况变得更糟.在 AVX512 之前没有 _mm_srai_epi64 内在/指令,所以我必须自己制作.

It gets worse. There is no _mm_srai_epi64 instrinsic/instruction until AVX512 so I had to make my own.

static inline __m128i _mm_srai_epi64(__m128i a, int b) {
    __m128i sra = _mm_srai_epi32(a,32);
    __m128i srl = _mm_srli_epi64(a,32);
    __m128i mask = _mm_set_epi32(-1,0,-1,0);
    __m128i out = _mm_blendv_epi8(srl, sra, mask);
}

<小时>

我上面的 _mm_srai_epi64 的实现是不完整的.我想我使用的是 Agner Fog 的 Vector Class Library.如果你查看文件 vectori128.h 你会发现


My implementation of _mm_srai_epi64 above is incomplete. I think I was using Agner Fog's Vector Class Library. If you look in the file vectori128.h you find

static inline Vec2q operator >> (Vec2q const & a, int32_t b) {
    // instruction does not exist. Split into 32-bit shifts
    if (b <= 32) {
        __m128i bb   = _mm_cvtsi32_si128(b);               // b
        __m128i sra  = _mm_sra_epi32(a,bb);                // a >> b signed dwords
        __m128i srl  = _mm_srl_epi64(a,bb);                // a >> b unsigned qwords
        __m128i mask = _mm_setr_epi32(0,-1,0,-1);          // mask for signed high part
        return  selectb(mask,sra,srl);
    }
    else {  // b > 32
        __m128i bm32 = _mm_cvtsi32_si128(b-32);            // b - 32
        __m128i sign = _mm_srai_epi32(a,31);               // sign of a
        __m128i sra2 = _mm_sra_epi32(a,bm32);              // a >> (b-32) signed dwords
        __m128i sra3 = _mm_srli_epi64(sra2,32);            // a >> (b-32) >> 32 (second shift unsigned qword)
        __m128i mask = _mm_setr_epi32(0,-1,0,-1);          // mask for high part containing only sign
        return  selectb(mask,sign,sra3);
    }
}

推荐答案

考虑使用各种指令的整数乘法的吞吐量限制的正确方法是每个周期可以计算多少乘积位".

The right way to think about the throughput limits of integer multiplication using various instructions is in terms of how many "product bits" you can compute per cycle.

mulx 每个周期产生一个 64x64 -> 128 结果;那是 64x64 = 4096 每周期的产品位数"

mulx produces one 64x64 -> 128 result every cycle; that's 64x64 = 4096 "product bits per cycle"

如果您从执行 32x32 -> 64 位乘法的指令中拼凑出 SIMD 上的乘法器,则您需要能够在每个周期获得四个结果以匹配 mulx (4x32x32 = 4096).如果除了乘法之外没有算术,那么您将在 AVX2 上收支平衡.不幸的是,正如您所注意到的,除了乘法之外,还有很多算术,所以这在当前一代硬件上是完全不可能的.

If you piece together a multiplier on SIMD out of instructions that do 32x32 -> 64 bit multiplies, you need to be able to get four results every cycle to match mulx (4x32x32 = 4096). If there was no arithmetic other than the multiplies, you'd just break even on AVX2. Unfortunately, as you've noticed, there's lots of arithmetic other than the multiplies, so this is a total non-starter on current generation hardware.

这篇关于SIMD 对 64 位 * 64 位到 128 位的无符号乘法进行签名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆