SSE优化的64位整数仿真 [英] SSE optimized emulation of 64-bit integers

查看:238
本文介绍了SSE优化的64位整数仿真的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我正在从事的一个业余项目,我需要在x86 CPU上模拟某些64位整数运算,并且它必须快速.

For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.

当前,我正在通过MMX指令执行此操作,但这确实很麻烦,因为我必须一直刷新fp寄存器状态(并且因为大多数MMX指令都处理 signed 整数,并且我需要无符号行为).

Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).

所以我想知道SO上的SSE/优化专家是否可以提出使用SSE更好的实施方案.

So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.

我需要的操作如下(非常特定):

The operations I need are the following (quite specific) ones:

uint64_t X, Y;

X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;

具体来说,我不需要通用的加法或移位,例如,只需加一个并左移一个即可.真的,只是这里显示的 exact 操作.

Specifically, I don't need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.

当然,除了在x86上,uint64_t是通过使用两个32位标量来模拟的,这很慢(而且,在我的情况下,这根本不起作用,因为我需要将加载/存储设为原子,在加载/存储两个单独的寄存器时不会出现这种情况.

Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn't work, because I need loads/stores to be atomic, which they won't be when loading/storing two separate registers).

因此,我需要一个SIMD解决方案. 其中一些操作是微不足道的,已经受到SSE2的支持.其他(!=<)需要更多的工作.

Hence, I need a SIMD solution. Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.

建议? SSE和SSE2都可以.可能需要一些说服力才能允许SSE3,并且SSE4可能是不可能的(支持SSE4的CPU可能无论如何都可以运行64位 ,因此我不需要这些解决方法)

Suggestions? SSE and SSE2 are fine. It'd take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don't need these workarounds)

推荐答案

SSE2直接支持某些64位整数运算:

SSE2 has direct support for some 64-bit integer operations:

将两个元素都设置为0:

__m128i z = _mm_setzero_si128();

将两个元素都设置为1:

__m128i z = _mm_set1_epi64x(1);      // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo);  // elements can be different

__m128i z = _mm_set_epi32(0,1,0,1);  // if any compilers refuse int64_t in 32-bit mode.  (None of the major ones do.)

设置/加载低64位,零扩展到__m128i

// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p);     // movq or movsd 64-bit load

_mm_cvtsi64x_si128(a);      // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a);  // portable for a value instead of pointer

基于_mm_set_epi32的事物可能会被某些编译器编译为混乱,因此_mm_loadl_epi64似乎是MSVC和ICC以及gcc/clang上最好的选择,并且对于您对atomic的要求实际上应该是安全的32位模式下的64位负载.在 Godbolt编译器浏览器

Things based on _mm_set_epi32 can get compiled into a mess by some compilers, so _mm_loadl_epi64 appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorer

垂直添加/减去每个64位整数:

__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)

左移:

__m128i z = _mm_slli_epi64(x,i)   // i must be an immediate

http://software.intel.com/sites/products/documentation/studio/composer/zh-CN/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm

按位运算符:

__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/zh-CN/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm

SSE没有增量,因此必须在1中使用常量.

SSE doesn't have increments, so you'll have to use a constant with 1.

比较起来比较困难,因为直到SSE4.1才支持64位. pcmpeqq 和SSE4.2 pcmpgtq

Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq

这是一个平等的人:

__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));

这会将每个64位元素设置为0xffffffffffff(如果相等则又称为-1).如果您希望将其作为01int中使用),则可以将其拔出使用_mm_cvtsi32_si128()并添加1.(但是有时您可以执行total -= cmp_result;而不是进行转换和添加.)

This will set the each 64-bit element to 0xffffffffffff (aka -1) if they are equal. If you want it as a 0 or 1 in an int, you can pull it out using _mm_cvtsi32_si128() and add 1. (But sometimes you can do total -= cmp_result; instead of converting and adding.)

且小于:(未经完全测试)

a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);

如果a中的相应元素小于b,则会将每个64位元素设置为0xffffffffffff.

This will set the each 64-bit element to 0xffffffffffff if the corresponding element in a is less than b.

这里是返回等式的等于"和小于"的版本.它们返回底部64位整数的比较结果.

Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.

inline bool equals(__m128i a,__m128i b){
    __m128i t = _mm_cmpeq_epi32(a,b);
    __m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
    return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
    a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a,b);
    __m128i u = _mm_cmpgt_epi32(a,b);
    __m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

这篇关于SSE优化的64位整数仿真的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆