SSE 优化的 64 位整数仿真 [英] SSE optimized emulation of 64-bit integers

查看:25
本文介绍了SSE 优化的 64 位整数仿真的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我正在从事的业余爱好项目,我需要在 x86 CPU 上模拟某些 64 位整数运算,并且它需要快速.

For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.

目前,我通过 MMX 指令执行此操作,但这确实很难处理,因为我必须一直刷新 fp 寄存器状态(并且因为大多数 MMX 指令处理 signed 整数,我需要无符号行为).

Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).

所以我想知道 SO 上的 SSE/优化专家是否可以使用 SSE 提出更好的实现.

So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.

我需要的操作如下(非常具体):

The operations I need are the following (quite specific) ones:

uint64_t X, Y;

X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;

具体来说,我不需要通用的加法或移位,比如加一左移一.真的,只是此处显示的确切操作.

Specifically, I don't need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.

当然,除了在 x86 上,uint64_t 是通过使用两个 32 位标量来模拟的,这很慢(在我的情况下,根本不起作用,因为我需要加载/存储是原子的,当加载/存储两个单独的寄存器时它们不会这样).

Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn't work, because I need loads/stores to be atomic, which they won't be when loading/storing two separate registers).

因此,我需要一个 SIMD 解决方案.其中一些操作是微不足道的,SSE2 已经支持.其他(!=<)需要更多的工作.

Hence, I need a SIMD solution. Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.

建议?SSE 和 SSE2 很好.允许 SSE3 需要一些说服力,而 SSE4 可能是不可能的(支持 SSE4 的 CPU 很可能运行 64 位反正,所以我不需要这些变通方法)

Suggestions? SSE and SSE2 are fine. It'd take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don't need these workarounds)

推荐答案

SSE2 直接支持一些 64 位整数运算:

SSE2 has direct support for some 64-bit integer operations:

将两个元素都设置为 0:

__m128i z = _mm_setzero_si128();

将两个元素都设置为 1:

__m128i z = _mm_set1_epi64x(1);      // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo);  // elements can be different

__m128i z = _mm_set_epi32(0,1,0,1);  // if any compilers refuse int64_t in 32-bit mode.  (None of the major ones do.)

设置/加载低 64 位,零扩展到 __m128i

// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p);     // movq or movsd 64-bit load

_mm_cvtsi64x_si128(a);      // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a);  // portable for a value instead of pointer

基于 _mm_set_epi32 的东西可能会被一些编译器编译成一团糟,所以 _mm_loadl_epi64 似乎是跨 MSVC 和 ICC 以及 gcc/clang 的最佳选择,并且对于您在 32 位模式下对原子 64 位加载的要求实际上应该是安全的.在 Godbolt 编译器浏览器

Things based on _mm_set_epi32 can get compiled into a mess by some compilers, so _mm_loadl_epi64 appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorer

垂直加/减每个 64 位整数:

__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_arithmetic.htm#intref_sse2_integer_arithmetic

左移:

__m128i z = _mm_slli_epi64(x,i)   // i must be an immediate

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm

按位运算符:

__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm

SSE 没有增量,因此您必须使用带有 1 的常量.

SSE doesn't have increments, so you'll have to use a constant with 1.

比较更难,因为在 SSE4.1 之前没有 64 位支持 pcmpeqq 和 SSE4.2 pcmpgtq

Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq

这是平等的:

__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));

如果它们相等,这会将每个 64 位元素设置为 0xffffffffffff(又名 -1).如果您希望将其作为 int 中的 01,您可以使用 _mm_cvtsi32_si128() 将其拉出并添加 1.(但有时您可以执行 total -= cmp_result; 而不是转换和添加.)

This will set the each 64-bit element to 0xffffffffffff (aka -1) if they are equal. If you want it as a 0 or 1 in an int, you can pull it out using _mm_cvtsi32_si128() and add 1. (But sometimes you can do total -= cmp_result; instead of converting and adding.)

小于:(未完全测试)

a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);

如果a中的对应元素小于b,这会将每个64位元素设置为0xffffffffffff.

This will set the each 64-bit element to 0xffffffffffff if the corresponding element in a is less than b.

这是返回布尔值的等于"和小于"版本.它们返回底部 64 位整数的比较结果.

Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.

inline bool equals(__m128i a,__m128i b){
    __m128i t = _mm_cmpeq_epi32(a,b);
    __m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
    return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
    a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a,b);
    __m128i u = _mm_cmpgt_epi32(a,b);
    __m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

这篇关于SSE 优化的 64 位整数仿真的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆