SSE2 按矢量移位 [英] SSE2 shift by vector

查看:55
本文介绍了SSE2 按矢量移位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试在 SSE2 内在函数中通过向量实现移位,但是通过实验和 intel 内在指南,它似乎只使用向量的最低有效部分.

I've been trying to implement shift by vector in SSE2 intrinsics, but from experimentation and the intel intrinsic guide, it appears to only use the least-significant part of the vector.

重新表述我的问题,给定一个向量 {v1, v2, ..., vn} 和一组移位 {s1, s2, ..., sn},我如何计算结果 {r1, r2,..., rn} 使得:

To reword my question, given a vector {v1, v2, ..., vn} and a set of shifts {s1, s2, ..., sn}, how do I calculate a result {r1, r2, ..., rn} such that:

r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn

因为看起来 _mm_sll_epi* 执行此操作:

since it appears that _mm_sll_epi* performs this:

r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1

提前致谢.

这是我的代码:

#include <iostream>

#include <cstdint>

#include <mmintrin.h>
#include <emmintrin.h>

namespace SIMD {

    using namespace std;

    class SSE2 {
    public:
        // flipped operands due to function arguments
        SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d) { low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); }

        uint64_t& operator[](int idx)
        {
            switch (idx) {
            case 0:
                _mm_storel_epi64((__m128i*)result, low);
                return result[0];
            case 1:
                _mm_store_si128((__m128i*)result, low);
                return result[1];
            case 2:
                _mm_storel_epi64((__m128i*)result, high);
                return result[0];
            case 3:
                _mm_store_si128((__m128i*)result, high);
                return result[1];
            }

            /* Undefined behaviour */
            return 0;
        }

        SSE2& operator<<=(const SSE2& rhs)
        {
            low  = _mm_sll_epi64(low,  rhs.getlow());
            high = _mm_sll_epi64(high, rhs.gethigh());

            return *this;
        }

        void print()
        {
            uint64_t a[2];
            _mm_store_si128((__m128i*)a, low);

            cout << hex;
            cout << a[0] << ' ' << a[1] << ' ';

            _mm_store_si128((__m128i*)a, high);

            cout << a[0] << ' ' << a[1] << ' ';
            cout << dec;
        }

        __m128i getlow() const
        {
            return low;
        }

        __m128i gethigh() const
        {
            return high;
        }
    private:
        __m128i low, high;
        uint64_t result[2];
    };
}

int main()
{
    cout << "operator<<= test: vector << vector: ";
    {
        auto x = SIMD::SSE2(7, 8, 15, 10);
        auto y = SIMD::SSE2(4, 5,  6,  7);

        x.print();
        y.print();

        x <<= y;

        if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) {
            cout << "FAILED: ";
            x.print();
            cout << endl;
        } else {
            cout << "PASSED" << endl;
        }
    }

    return 0;
}

应该发生的事情得到 {7 <<4 = 112, 8 <<5 = 256, 15 <<6 = 960, 10 <<7 = 1280}.结果似乎是 {7 <<4 = 112, 8 <<4 = 128, 15 <<6 = 960, 15 <<6 = 640},这不是我想要的.

What should be happening gets results of {7 << 4 = 112, 8 << 5 = 256, 15 << 6 = 960, 10 << 7 = 1280}. The results seem to be {7 << 4 = 112, 8 << 4 = 128, 15 << 6 = 960, 15 << 6 = 640}, which isn't what I want.

希望这会有所帮助,Jens.

Hope this helps, Jens.

推荐答案

如果 AVX2 可用,并且您的元素是 32 位或 64 位,您的操作需要一个可变移位指令:vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count))

If AVX2 is available, and your elements are 32 or 64 bits, your operation takes one variable-shift instruction: vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count) )

对于具有 SSE4.1 的 32 位元素,请参阅 右移 4 个整数通过不同的值 SIMD.根据延迟与吞吐量要求,您可以进行单独的移位,然后混合,或使用乘法(通过特殊构造的 2 的幂向量)来获得可变计数的左移位,然后执行相同的计数-所有元素右移.

For 32bit elements with SSE4.1, see Shifting 4 integers right by different values SIMD. Depending on latency vs. throughput requirements, you can do separate shifts shift and then blend, or use a multiply (by a specially-constructed vector of powers of 2) to get variable-count left shifts and then do a same-count-for-all-elements right shift.

每个 SSE 向量只有两个元素,所以我们只需要两次移位然后组合结果(我们可以用 pblendw 或浮点 movsd(这可能会导致某些 CPU 上的额外旁路延迟延迟),或者我们可以使用两次 shuffle,或者我们可以执行两个 AND 和一个 OR.

There are only two elements per SSE vector, so we just need two shifts and then combine the results (which we can do with a pblendw, or with a floating-point movsd (which may cause extra bypass-delay latency on some CPUs), or we can use two shuffles, or we can do two ANDs and an OR.

__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)
{
    __m128i shift_low = _mm_srl_epi64(a, count);          // high 64 is garbage
    __m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
    __m128i shift_high = _mm_srl_epi64(a, count_high);    // low 64 is garbage
    // SSE4.1:
    // return _mm_blend_epi16(shift_low, shift_high, 0x0F);

#if 1   // use movsd to blend
    __m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) );  // use movsd as a blend.  Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
    return _mm_castpd_si128(blended);
#else  // SSE2 without using FP instructions:
    // if we're going to do it this way, we could have shuffled the input before shifting.  Probably not helpful though.
    shift_high = _mm_unpackhi_epi64(shift_high, shift_high);       // broadcast the high64
    return       _mm_unpacklo_epi64(shift_high, shift_low);        // combine
#endif
}

其他 shuffle,如 pshufd 或 psrldq 也可以,但 punpckhqdq 无需立即字节即可完成工作,因此它缩短了一个字节.SSSE3 palignr 可以从一个寄存器中获取高元素并且将另一个寄存器中的低元素转换为一个向量,但它们会被反转(因此我们需要一个 pshufd 来交换高半和低半).shufpd 可以混合使用,但与 movsd 相比没有优势.

Other shuffles like pshufd or psrldq would work, but punpckhqdq gets the job done without needing an immediate byte, so it's one byte shorter. SSSE3 palignr could get the high element from one register and the low element from another register into one vector, but they'd be reversed (so we'd need a pshufd to swap high and low halves). shufpd would work to blend, but has no advantage over movsd.

有关使用 FP 指令的潜在旁路延迟延迟的详细信息,请参阅 Agner Fog 的微架构指南在两个整数指令之间.在 Intel SnB 系列 CPU 上可能没问题,因为其他 FP shuffle 是.(是的,movsd xmm1, xmm0 在 port5 中的 shuffle 单元上运行.使用 movapsmovapd 进行 reg-reg 移动,即使是标量,如果您不需要合并行为).

See Agner Fog's microarch guide for the details of the potential bypass-delay latency from using an FP instruction between two integer instructions. It's probably fine on Intel SnB-family CPUs, because other FP shuffles are. (And yes, movsd xmm1, xmm0 runs on the shuffle unit in port5. Use movaps or movapd for reg-reg moves even of scalars if you don't need the merging behaviour).

这个编译(在 Godbolt 与 gcc5.3 -O3) 到

This compiles (on Godbolt with gcc5.3 -O3) to

    movdqa  xmm2, xmm0  # tmp97, a
    psrlq   xmm2, xmm1    # tmp97, count
    punpckhqdq      xmm1, xmm1  # tmp99, count
    psrlq   xmm0, xmm1    # tmp100, tmp99
    movsd   xmm0, xmm2    # tmp102, tmp97
    ret

这篇关于SSE2 按矢量移位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆