向量化模块化算术 [英] Vectorizing Modular Arithmetic

查看:121
本文介绍了向量化模块化算术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一些相当快的基于组件的矢量加法代码.我正在使用(相信,有符号)64位整数.

I'm trying to write some reasonably fast component-wise vector addition code. I'm working with (signed, I believe) 64-bit integers.

该功能是

void addRq (int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    for(int i = 0; i < dim; i++) {
        a[i] = (a[i]+b[i])%q; // LINE1
    }
}

我正在IvyBridge(SSE4.2和AVX,但不是AVX2)上使用icc -std=gnu99 -O3(icc,以便以后可以使用SVML)进行编译.

I'm compiling with icc -std=gnu99 -O3 (icc so I can use SVML later) on an IvyBridge (SSE4.2 and AVX, but not AVX2).

我的基准是从LINE1中删除%q.用dim=11221184进行100次(迭代)函数调用需要1.6秒. ICC自动对SSE的代码进行矢量化;很好.

My baseline is removing the %q from LINE1. 100 (iterated) function calls with dim=11221184 takes 1.6 seconds. ICC auto-vectorizes the code for SSE; great.

我真的很想做模块化添加.使用%q,ICC不会自动对代码进行矢量化处理,并且会在11.8秒(!)内运行.即使忽略了先前尝试的自动矢量化,这似乎仍然过多.

I really want to do modular additions though. With the %q, ICC does not auto-vectorize the code, and it runs in 11.8 seconds(!). Even ignoring the auto-vectorization for the previous attempt, this still seems excessive.

由于我没有AVX2,因此使用SSE进行矢量化需要SVML,这也许就是为什么ICC没有自动矢量化的原因.无论如何,这是我对内部循环进行矢量化的尝试:

Since I don't have AVX2, vectorization with SSE requires SVML, which is perhaps why ICC didn't auto-vectorize. At any rate, here's my attempt to vectorize the inner loop:

__m128i qs = _mm_set1_epi64x(q);
for(int i = 0; i < dim; i+=2) {
    __m128i xs = _mm_load_si128((const __m128i*)(a+i));
    __m128i ys = _mm_load_si128((const __m128i*)(b+i));
    __m128i zs = _mm_add_epi64(xs,ys);
    zs = _mm_rem_epi64(zs,qs);
    _mm_store_si128((__m128i*)(a+i),zs);
}

主循环的组装是:

..B3.4:                         # Preds ..B3.2 ..B3.12
    movdqa    (%r12,%r15,8), %xmm0                          #59.22
    movdqa    %xmm8, %xmm1                                  #60.14
    paddq     (%r14,%r15,8), %xmm0                          #59.22
    call      __svml_i64rem2                                #61.9
    movdqa    %xmm0, (%r12,%r15,8)                          #61.36
    addq      $2, %r15                                      #56.30
    cmpq      %r13, %r15                                    #56.24
    jl        ..B3.4        # Prob 82%                      #56.24

因此,代码正在按预期进行矢量化.我知道由于SVML,我可能无法获得2倍的加速,但是代码运行的时间为12.5秒,比根本没有向量化的速度要慢!这真的是最好的吗?

So the code is getting vectorized as expected. I know I might not get a 2x speedup due to SVML, but the code runs in 12.5 seconds, slower than with no vectorization at all! Is this really the best that can be done here?

推荐答案

SSE2和AVX2都没有整数除法指令.英特尔不敢称呼SVML函数内在函数,因为它们中的许多都是复杂的函数,它们映射到多个指令,而不仅仅是几个指令.

Neither SSE2 nor AVX2 have integer division instructions. Intel is disingenuous to call the SVML functions intrinsics since many of them are complicated functions which map to several instructions and not just a few.

有一种方法可以对SSE2或AVX2进行更快的除法(模运算).参见本文改进了不变整数的除法.基本上,您可以预先计算除数,然后进行乘法.预计算除数需要时间,但是对于代码中的dim某个值,它应该会获胜.我在这里 SSE整数除法更详细地描述了此方法? 我还在素数查找器>查找带有SIMD的质数-SSE/AVX

There is a way to do faster division (and modulo) with SSE2 or AVX2. See this paper Improved division by invariant integers. Basically you precompute a divisor and then do multiplication. Precomputing the divisor takes time but for some value of dim in your code it should win out. I described this method in more detail here SSE integer division? I also successfully implemented this method in a prime number finder Finding lists of prime numbers with SIMD - SSE/AVX

Agner Fog在其 Vector类使用该论文中描述的方法.如果您需要一些代码,那将是一个很好的起点,但是您必须将其扩展到64位.

Agner Fog implements 32-bit (but not 64-bit) division in his Vector Class using the method described in that paper. That would be a good place to start if you want some code but you will have to extend it to 64-bit.

基于Mysticial的评论并假设输入已经减少,我为SSE创建了一个版本.如果该版本是在MSVC中编译的,则它必须处于64位模式(如32位模式)不支持_mm_set1_epi64x.可以将其固定为32位模式,但是我不想这样做.

Based on Mysticial's comments and assuming that the inputs are already reduced I produced a version for SSE. If this is compiled in MSVC then it needs to be in 64 bit mode as 32 bit mode does not support _mm_set1_epi64x. This can be fixed for 32 bit mode mode but I don't want to do it.

#ifdef _MSC_VER 
#include <intrin.h>
#endif
#include <nmmintrin.h>                 // SSE4.2
#include <stdint.h>
#include <stdio.h>

void addRq_SSE(int64_t* a, const int64_t* b, const int32_t dim, const int64_t q) {
    __m128i q2 = _mm_set1_epi64x(q);
    __m128i t2 = _mm_sub_epi64(q2,_mm_set1_epi64x(1));
    for(int i = 0; i < dim; i+=2) {
        __m128i a2 = _mm_loadu_si128((__m128i*)&a[i]);
        __m128i b2 = _mm_loadu_si128((__m128i*)&b[i]);
        __m128i c2 = _mm_add_epi64(a2,b2);
        __m128i cmp = _mm_cmpgt_epi64(c2, t2);
        c2 = _mm_sub_epi64(c2, _mm_and_si128(q2,cmp));
        _mm_storeu_si128((__m128i*)&a[i], c2);
    }
}

int main() {
    const int64_t dim = 20;
    int64_t a[dim];
    int64_t b[dim];
    int64_t q = 10;

    for(int i=0; i<dim; i++) {
        a[i] = i%q; b[i] = i%q;
    }
    addRq_SSE(a, b, dim, q);
    for(int i=0; i<dim; i++) {
        printf("%d\n", a[i]);
    }   
}

这篇关于向量化模块化算术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆