实用的BigNum AVX/SSE可行吗? [英] practical BigNum AVX/SSE possible?

查看:149
本文介绍了实用的BigNum AVX/SSE可行吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SSE/AVX寄存器可以视为整数或浮点数BigNum.也就是说,可以忽略根本存在车道.是否存在一种简单的方法来利用这种观点并将这些寄存器单独或组合用作BigNum?我之所以问是因为,从我对BigNum库的了解很少,它们几乎都在数组上而不是在SSE/AVX寄存器上进行通用存储和算术运算.可移植性?

SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or combined? I ask because from what little I've seen of BigNum libraries, they almost universally store and do arithmetic on arrays, not on SSE/AVX registers. Portability?

示例:

假设您将SSE寄存器的内容作为键存储在std::set中,则可以将这些内容作为BigNum进行比较.

Say you store the contents of a SSE register as a key in a std::set, you could compare these contents as a BigNum.

推荐答案

我认为可能可以通过SIMD有效地实现BigNum,但不能以您建议的方式实现.

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest.

您应该一次处理多个BigNum,而不是使用SIMD寄存器(或SIMD寄存器数组)实现单个BigNum.

Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once.

让我们考虑一下128位加法.让128位整数由一对高和低64位值定义,并假设我们想将128位整数(y_low, y_high)添加到128位整数(x_low, x_high)中.对于标量64位寄存器,这仅需要两条指令

Let's consider 128-bit addition. Let 128-bit integers be defined by a pair of high and low 64-bit values and let's assume we want to add a 128-bit integer (y_low, y_high) to a 128-bit integer (x_low, x_high). With the scalar 64-bit registers this requires only two instructions

add rax, rdi // x_low  += y_low;
adc rdx, rsi // x_high += y_high + (x_low < y_low);

正如其他人所解释的那样,使用SSE/AVX时,问题在于没有SIMD进位标志.必须先计算进位标志,然后再添加.这需要64位无符号比较. SSE唯一可行的选择是来自AMD XOP指令vpcomgtuq

With SSE/AVX the problem, as others have explain, is that there is no SIMD carry flags. The carry flag has to be calculated and then added. This requires a 64-bit unsigned comparison. The only realistic option for this with SSE is from the AMD XOP instruction vpcomgtuq

vpaddq      xmm2, xmm0, xmm2 // x_low  += y_low;
vpcomgtuq   xmm0, xmm0, xmm2 // x_low  <  y_low
vpaddq      xmm1, xmm1, xmm3 // x_high += y_high
vpsubq      xmm0, xmm1, xmm0 // x_high += xmm0

这使用四个指令来添加两对128位数字.对于标量64位寄存器,这也需要四个指令(两个add和两个adc).

This uses four instructions to add two pairs of 128-bit numbers. With the scalar 64-bit registers this requires four instructions as well (two add and two adc).

使用AVX2,我们可以一次添加四对128位数字.但是,没有来自XOP的256位宽的64位无符号指令.相反,我们可以对a<b执行以下操作:

With AVX2 we can add four pairs of 128-bit numbers at once. But there is no 256-bit wide 64-bit unsigned instruction from XOP. Instead we can do the following for a<b:

__m256i sign64 = _mm256_set1_epi64x(0x8000000000000000L);
__m256i aflip = _mm256_xor_si256(a, sign64);
__m256i bflip = _mm256_xor_si256(b, sign64);
__m256i cmp = _mm256_cmpgt_epi64(aflip,bflip);

sign64寄存器可以预先计算,因此实际上只需要三个指令.因此,通过六条指令可以在AVX2中添加四对128位数字

The sign64 register can be precomputed so only three instructions are really necessary. Therefore, adding four pairs of 128-bit numbers with AVX2 can be done with six instructions

vpaddq
vpaddq
vpxor
vpxor
vpcmpgtq 
vpsubq

标量寄存器需要八条指令.

whereas the scalar registers need eight instructions.

AVX512仅具有一条用于执行64位无符号比较vpcmpuq的指令.因此,应该仅使用四个指令就可以添加八对128位数字

AVX512 has a single instruction for doing 64-bit unsigned comparison vpcmpuq. Therefore, it should be possible to add eight pairs of 128-bit numbers using only four instructions

vpaddq
vpaddq
vpcmpuq
vpsubq

使用标量寄存器,需要16条指令来添加八对128位数字.

With the scalar register it would require 16 instructions to add eight pairs of 128-bit numbers.

这是一张表格,其中汇总了SIMD指令的数量(称为nSIMD)和添加若干对(称为npairs)的128位数字所需的标量指令的数量(称为nscalar)

Here is a table with a summary of the number of SIMD instructions (called nSIMD) and the number of scalar instructions (called nscalar) necessary to add a number of pairs (called npairs) of 128-bit numbers

              nSIMD      nscalar     npairs
SSE2 + XOP        4           4           2
AVX2              6           8           4
AVX2 + XOP2       4           8           4
AVX-512           4          16           8

请注意,XOP2尚不存在,我只是推测它可能在某个时候存在.

Note that XOP2 does not exist yet and I am only speculating that it may exist at some point.

还要注意,要有效地执行此操作,需要将BigNum数组存储在数组struct(AoSoA)形式的数组中.例如,使用l表示较低的64位,而使用h表示较高的64位,则128位整数数组将像这样的结构数组存储

Note also that to do this efficiently the BigNum arrays needs to be stored in an array of struct of array (AoSoA) form. For example using l to mean the lower 64-bits and h to mean the high 64-bits an array of 128-bit integers stores as an array of structs like this

lhlhlhlhlhlhlhlh

应该使用这样的AoSoA存储

should instead be stored using an AoSoA like this

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh

这篇关于实用的BigNum AVX/SSE可行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆