非恒定的除数量化整数除法的最快速的方法 [英] Fastest method of vectorized integer division by non-constant divisor

查看:138
本文介绍了非恒定的除数量化整数除法的最快速的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

/ 这个问题我写了一个性能测试的 GCC 4.9.2(MinGW64)估算,多个整数除法的方法是快,如下:

Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following:

#include <emmintrin.h>  // SSE2

static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111};  // Dividend

__attribute__((noinline)) static void test_div_x86(unsigned i){
    for(; i; --i)
        x[0] /= i,
        x[1] /= i,
        x[2] /= i,
        x[3] /= i,
        x[4] /= i,
        x[5] /= i,
        x[6] /= i,
        x[7] /= i;
}

__attribute__((noinline)) static void test_div_sse(unsigned i){
    for(; i; --i){
        __m128i xmm0 = _mm_loadu_si128((const __m128i*)x);
        __m128 xmm1 = _mm_set1_ps(i);
        _mm_storeu_si128(
            (__m128i*)x,
            _mm_packs_epi32(
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpacklo_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                ),
                _mm_cvtps_epi32(
                    _mm_div_ps(
                        _mm_cvtepi32_ps(_mm_unpackhi_epi16(xmm0, _mm_setzero_si128())),
                        xmm1
                    )
                )
            )
        );
    }
}

int main(){
    const unsigned runs = 40000000; // Choose a big number, so the compiler doesn't dare to unroll loops and optimize with constants
    test_div_x86(runs),
    test_div_sse(runs);
    return 0;
}

由GNU GPROF和工具的参数结果。

The results by GNU Gprof and tools parameters.

/*
gcc -O? -msse2 -pg -o test.o -c test.c
g++ -o test test.o -pg
test
gprof test.exe gmon.out
-----------------------------------
        test_div_sse(unsigned int)      test_div_x86(unsigned int)
-O0     2.26s                           1.10s
-O1     1.41s                           1.07s
-O2     0.95s                           1.09s
-O3     0.77s                           1.07s
*/

现在我很困惑,为什么在x86测试勉强得到优化,上证所的测试变得更快,虽然昂贵的转换&安培;从浮点。此外,我想知道有多少结果依赖于编译器和架构。

Now i'm confused why the x86 test barely gets optimized and the SSE test becomes faster though the expensive conversion to & from floating point. Furthermore i'd like to know how much results depend on compilers and architectures.

要总结一下:什么是在最后快:将一个接一个或浮点弯路

To summarize it: what's faster at the end: dividing one-by-one or the floating-point detour?

推荐答案

用相同的标量除以向量的所有元素可以与整数乘法和移位来完成。 libdivide (C / C ++,Zlib许可证)提供了一些内联函数做这个标量(如 INT ),以及由标量除以向量。另请参见 SSE整数除法?(如你在你的问题提)为类似的技术提供近似的结果。它是更有效的,如果相同的标量将被应用到许多载体。 libdivide不说一下,结果是不准确的东西,但我没有调查。

Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.

回复:您的code:
你必须要小心检查什么编译器实际上产生,给它当一个平凡的循环这样。例如它实际上是加载/存储回RAM每次迭代?或者是保持变量住在寄存器中,只有在最后储存?

re: your code: You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?

您基准歪斜赞成整数除法回路,由于矢量分压器不会保持在载体环占用100%,但整数除法器被保持在整型循环占用100%。 (注释中讨论后添加这些段落。在previous答案没有解释关于保持分频器喂食,和依存关系链为多。)

Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)

您只需要在你的向量环单一依赖链,所以向量产分第二次结果后闲置几个周期每一次迭代,而转换FP-> SI,包链,解压缩,转换Si-所示> FP发生。你已经设置的东西,所以你的吞吐量是由整个循环携带依赖性链的长度,而不是FP分频器的吞吐量的限制。如果数据在每次迭代是独立的(或至少几个有独立的价值,怎么样,你有对INT回路8阵元),一组值,那么解包/转换和转换/包将与<$重叠C $ C> divps 执行时间为另一个向量。矢量分仅部分流水线,但一切如果完全流水线。

You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.

这是吞吐量和延迟之间的差,以及为什么它的事项为流水线乱序执行的CPU。

This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.

在code。其他的东西:

Other stuff in your code:

您有 __ M128将xmm1 = _mm_set1_ps(I); 在内部循环。 _set1 与ARG不是编译时间常数通常至少2说明: MOVD pshufd 。在这种情况下,一个int到浮转换,太。加入的载体保持你的循环计数器的浮动矢量版本,你增加1.0 ,效果会更好。 (虽然这可能不是扔了你的速度测试任何进一步的,因为这多余的计算可以用其他的东西重叠。)

You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)

零开箱工作正常。 SSE4.1 __ m128i _mm_cvtepi16_epi32(__m128i一)是另一种方式。 pmovsxwd 是相同的速度,但并不需要一个归零的寄存器。

Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.

如果你要转换为FP的鸿沟,你认为只是保持你的数据作为FP一会儿?取决于你的算法如何需要四舍五入的情况发生。

If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.

divps (压缩单浮点)是10-13个周期的延迟,以每吞吐量7一个周期,最近英特尔的设计。 DIV / IDIV R16 ((GP中的无符号章)整数除法)是23-26个周期的延迟,每9个或8周​​期吞吐量之一。 DIV 11微指令,所以即使是在其他东西的方式得到发卡/对于一些它会通过管道的时间执行。 ( divps 是一个单UOP)。所以,英特尔CPU是不是真的在设计整数除法要快,但作出FP师的努力。

divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.

所以才单独划分,一个单一的整数除法比矢量FP分裂慢。你会甚至与转换到/从浮动,而解包/包顺理成章了。

So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.

如果你能做到在矢量暂存器其他整数OPS,这将是理想的。否则,你必须得到整数进/出载体的REG。如果整数在RAM中,载体负载的罚款。如果你生成它们一次一个, PINSRW 是一种选择,但它可能只是存储到内存中建立了一个载体负载将是一个更快的方法加载全矢量。类似的用于获取数据退了出来,用 PEXTRW 或存储到RAM中。如果你想在GP寄存器的值,跳过转换回诠释后,刚 MOVD / PEXTRD 从这两个向量取其​​暂存器你的价值是,插入/提取指令需要在英特尔2微指令,这意味着他们采取了两个插槽,相比之下,服用唯一一个融合域UOP大多数指令。

If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.

您时序结果,显示出标量code不能与编译器优化改进,是因为CPU可以重叠的其他元素的冗长非优化的负载/存储指令而除法单元是瓶颈。另一方面向量循环仅具有一个或两个依赖链,与每次迭代依赖于previous,添加延迟,所以额外的指令不能与任何重叠。与 -O0 测试是pretty很多从未有用的。

Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.

这篇关于非恒定的除数量化整数除法的最快速的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆