SSE优化的64位整数仿真 [英] SSE optimized emulation of 64-bit integers

查看：238 发布时间：2020/8/19 20:37:33 c++ optimization x86 64-bit sse

本文介绍了SSE优化的64位整数仿真的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于我正在从事的一个业余项目，我需要在x86 CPU上模拟某些64位整数运算，并且它必须快速.

For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast.

当前，我正在通过MMX指令执行此操作，但这确实很麻烦，因为我必须一直刷新fp寄存器状态(并且因为大多数MMX指令都处理 signed 整数，并且我需要无符号行为).

Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior).

所以我想知道SO上的SSE/优化专家是否可以提出使用SSE更好的实施方案.

So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE.

我需要的操作如下(非常特定):

The operations I need are the following (quite specific) ones:

uint64_t X, Y;

X = 0;
X = 1;
X << 1;
X != Y;
X + 1;
X & 0x1 // get lsb
X | 0x1 // set lsb
X > Y;

具体来说，我不需要通用的加法或移位，例如，只需加一个并左移一个即可.真的，只是这里显示的 exact 操作.

Specifically, I don't need general-purpose addition or shifting, for example, just add one and left-shift one. Really, just the exact operations shown here.

当然，除了在x86上，uint64_t是通过使用两个32位标量来模拟的，这很慢(而且，在我的情况下，这根本不起作用，因为我需要将加载/存储设为原子，在加载/存储两个单独的寄存器时不会出现这种情况.

Except, of course, on x86, uint64_t is emulated by using two 32-bit scalars, which is slow (and, in my case, simply doesn't work, because I need loads/stores to be atomic, which they won't be when loading/storing two separate registers).

因此，我需要一个SIMD解决方案. 其中一些操作是微不足道的，已经受到SSE2的支持.其他(!=和<)需要更多的工作.

Hence, I need a SIMD solution. Some of these operations are trivial, supported by SSE2 already. Others (!= and <) require a bit more work.

建议? SSE和SSE2都可以.可能需要一些说服力才能允许SSE3，并且SSE4可能是不可能的(支持SSE4的CPU可能无论如何都可以运行64位，因此我不需要这些解决方法)

Suggestions? SSE and SSE2 are fine. It'd take some persuasion to permit SSE3, and SSE4 is probably out of the question (A CPU which supports SSE4 is likely to run 64-bit anyway, and so I don't need these workarounds)

推荐答案

SSE2直接支持某些64位整数运算:

SSE2 has direct support for some 64-bit integer operations:

将两个元素都设置为0:

__m128i z = _mm_setzero_si128();

将两个元素都设置为1:

__m128i z = _mm_set1_epi64x(1);      // also works for variables.
__m128i z = _mm_set_epi64x(hi, lo);  // elements can be different

__m128i z = _mm_set_epi32(0,1,0,1);  // if any compilers refuse int64_t in 32-bit mode.  (None of the major ones do.)

设置/加载低64位，零扩展到__m128i

// supported even in 32-bit mode, and listed as an intrinsic for MOVQ
// so it should be atomic on aligned integers.
_mm_loadl_epi64((const __m128i*)p);     // movq or movsd 64-bit load

_mm_cvtsi64x_si128(a);      // only ICC, others refuse in 32-bit mode
_mm_loadl_epi64((const __m128i*)&a);  // portable for a value instead of pointer

基于_mm_set_epi32的事物可能会被某些编译器编译为混乱，因此_mm_loadl_epi64似乎是MSVC和ICC以及gcc/clang上最好的选择，并且对于您对atomic的要求实际上应该是安全的32位模式下的64位负载.在 Godbolt编译器浏览器

Things based on _mm_set_epi32 can get compiled into a mess by some compilers, so _mm_loadl_epi64 appears to be the best bet across MSVC and ICC as well as gcc/clang, and should actually be safe for your requirement of atomic 64-bit loads in 32-bit mode. See it on the Godbolt compiler explorer

垂直添加/减去每个64位整数:

__m128i z = _mm_add_epi64(x,y)
__m128i z = _mm_sub_epi64(x,y)

左移:

__m128i z = _mm_slli_epi64(x,i)   // i must be an immediate

http://software.intel.com/sites/products/documentation/studio/composer/zh-CN/2011/compiler_c/intref_cls/common/intref_sse2_int_shift.htm

按位运算符:

__m128i z = _mm_and_si128(x,y)
__m128i z = _mm_or_si128(x,y)

http://software.intel.com/sites/products/documentation/studio/composer/zh-CN/2011/compiler_c/intref_cls/common/intref_sse2_integer_logical.htm

SSE没有增量，因此必须在1中使用常量.

SSE doesn't have increments, so you'll have to use a constant with 1.

比较起来比较困难，因为直到SSE4.1才支持64位. pcmpeqq 和SSE4.2 pcmpgtq

Comparisons are harder since there's no 64-bit support until SSE4.1 pcmpeqq and SSE4.2 pcmpgtq

这是一个平等的人:

__m128i t = _mm_cmpeq_epi32(a,b);
__m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));

这会将每个64位元素设置为0xffffffffffff(如果相等则又称为-1).如果您希望将其作为0或1在int中使用)，则可以将其拔出使用_mm_cvtsi32_si128()并添加1.(但是有时您可以执行total -= cmp_result;而不是进行转换和添加.)

This will set the each 64-bit element to 0xffffffffffff (aka -1) if they are equal. If you want it as a 0 or 1 in an int, you can pull it out using _mm_cvtsi32_si128() and add 1. (But sometimes you can do total -= cmp_result; instead of converting and adding.)

且小于:(未经完全测试)

a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
__m128i t = _mm_cmplt_epi32(a,b);
__m128i u = _mm_cmpgt_epi32(a,b);
__m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);

如果a中的相应元素小于b，则会将每个64位元素设置为0xffffffffffff.

This will set the each 64-bit element to 0xffffffffffff if the corresponding element in a is less than b.

这里是返回等式的等于"和小于"的版本.它们返回底部64位整数的比较结果.

Here's are versions of "equals" and "less-than" that return a bool. They return the result of the comparison for the bottom 64-bit integer.

inline bool equals(__m128i a,__m128i b){
    __m128i t = _mm_cmpeq_epi32(a,b);
    __m128i z = _mm_and_si128(t,_mm_shuffle_epi32(t,177));
    return _mm_cvtsi128_si32(z) & 1;
}
inline bool lessthan(__m128i a,__m128i b){
    a = _mm_xor_si128(a,_mm_set1_epi32(0x80000000));
    b = _mm_xor_si128(b,_mm_set1_epi32(0x80000000));
    __m128i t = _mm_cmplt_epi32(a,b);
    __m128i u = _mm_cmpgt_epi32(a,b);
    __m128i z = _mm_or_si128(t,_mm_shuffle_epi32(t,177));
    z = _mm_andnot_si128(_mm_shuffle_epi32(u,245),z);
    return _mm_cvtsi128_si32(z) & 1;
}

这篇关于SSE优化的64位整数仿真的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SSE优化的64位整数仿真 [英] SSE optimized emulation of 64-bit integers

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE优化的64位整数仿真 [英] SSE optimized emulation of 64-bit integers

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭