最佳 SSE 无符号 8 位比较 [英] Optimal SSE unsigned 8 bit compare

查看:48
本文介绍了最佳 SSE 无符号 8 位比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到使用 SSE(最高可达 SSE 4.2)执行 8 位无符号比较的最有效方法.

I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2).

我正在处理的最常见的情况是比较 > 0U,例如

The most common case I'm working on is comparing for > 0U, e.g.

_mm_cmpgt_epu8(v, _mm_setzero_si128())                // #1

(当然也可以认为是对非零的简单测试.)

(which of course can also be considered to be a simple test for non-zero.)

但我也对更一般的情况感兴趣,例如

But I'm also somewhat interested in the more general case, e.g.

_mm_cmpgt_epu8(v1, v2)                                // #2

第一种情况可以用 2 条指令实现,使用各种不同的方法,例如与 0 比较,然后取反结果.第二种情况通常需要 3 条指令,例如从两个操作数中减去 128 并执行有符号比较.(有关各种 3 指令解决方案,请参阅这个问题.)

The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert the result. The second case typically requires 3 instructions, e.g. subtract 128 from both operands and perform signed compare. (See this question for various 3 instruction solutions.)

理想情况下,我正在寻找#1 的单指令解决方案和#2 的双指令解决方案.如果这两种方法都不可能,那么我也想知道在现代英特尔 CPU(Sandy Bridge、Ivy Bridge、Haswell)上,各种可能的 2 或 3 条指令实现中哪一种最有效.

What I'm looking for ideally is a single instruction solution for #1, and a two instruction solution for #2. If neither of these are possible then I'm also interested in thoughts as to which of the various possible 2 or 3 instruction implementations is most efficient on modern Intel CPUs (Sandy Bridge, Ivy Bridge, Haswell).

迄今为止案例 #2 的最佳实现:

Best implementations for case #2 so far:

  1. 与无符号最大值比较并取反结果:

#define _mm_cmpgt_epu8(v0, v1) \_mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(v0, v1), v1), \_mm_set1_epi8(-1))

两条算术指令 + 一条按位 = 1.33 吞吐量.

Two arithmetic instructions + one bitwise = 1.33 throughput.

  1. 反转两个参数的符号位(== 减去 128)并使用有符号比较:

#define _mm_cmpgt_epu8(v0, v1) \_mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \_mm_xor_si128(v1, _mm_set1_epi8(-128)))

一条算术指令 + 两条按位 = 1.16 吞吐量.

One arithmetic instruction + two bitwise = 1.16 throughput.

案例 #1 的最佳实现,源自上面案例 #2 的实现:

Best implementations for case #1, derived from case #2 implementations above:

  • 1.

#define _mm_cmpgtz_epu8(v0) \_mm_andnot_si128(_mm_cmpeq_epi8(v0, _mm_set1_epi8(0)), \_mm_set1_epi8(-1))

一条算术指令 + 一条按位 = 0.83 吞吐量.

One arithmetic instruction + one bitwise = 0.83 throughput.

  • 2.

#define _mm_cmpgtz_epu8(v0) \_mm_cmpgt_epi8(_mm_xor_si128(v0, _mm_set1_epi8(-128)), \_mm_set1_epi8(-128)))

一条算术指令 + 一条按位 = 0.83 吞吐量.

One arithmetic instruction + one bitwise = 0.83 throughput.

推荐答案

有一个来自 Simd Library 的例子:

There is an example from Simd Library:

    const __m128i K_INV_ZERO = SIMD_MM_SET1_EPI8(0xFF);//_mm_set1_epi8(-1);

    SIMD_INLINE __m128i NotEqual8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), K_INV_ZERO);
    }

    SIMD_INLINE __m128i Greater8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i GreaterOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
    }

    SIMD_INLINE __m128i Lesser8u(__m128i a, __m128i b)
    {
        return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), K_INV_ZERO);
    }

    SIMD_INLINE __m128i LesserOrEqual8u(__m128i a, __m128i b)
    {
        return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
    }

这篇关于最佳 SSE 无符号 8 位比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆