SIMD用于浮动阈值操作 [英] SIMD for float threshold operation

查看:88
本文介绍了SIMD用于浮动阈值操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想更快地进行一些矢量计算,并且我相信SIMD指令可以进行浮点比较和操作,操作如下:

I would like to make some vector computation faster, and I believe that SIMD instructions for float comparison and manipulation could help, here is the operation:

void func(const double* left, const double* right, double* res, const size_t size, const double th, const double drop) {
        for (size_t i = 0; i < size; ++i) {
            res[i] = right[i] >= th ? left[i] : (left[i] - drop) ;
        }
    }

主要是删除如果的值大于,则将左的值降低阈值

Mainly, it drops the left value by drop in case right value is higher than threshold.

大小约为128-256(不是很大),但计算量很大。

The size is around 128-256 (not that big), but computation is called heavily.

我尝试从循环展开开始,但是没有赢得很多性能,但是可能需要一些编译指令。

I tried to start with loop unrolling, but did not win a lot of performance, but may be some compile instructions are needed.

可以您是否建议对代码进行一些改进以加快计算速度?

Could you please suggest some improvement into the code for faster computation?

推荐答案

Clang已经以这种方式自动矢量化了在指针上使用 __ restrict ,因此它不需要后备版本,该版本可用于某些数组之间的重叠。

Clang already auto-vectorizes this pretty much the way Soonts suggested doing manually. Use __restrict on your pointers so it doesn't need a fallback version that works for overlap between some of the arrays. It still auto-vectorizes, but it bloats the function.

不幸的是,gcc仅使用 -ffast-math 。原来只需要 -fno-trapping-math :这通常是安全的,特别是如果您不使用 fenv 访问以取消屏蔽任何FP异常( feenableexcept )或查看MXCSR粘性FP异常标志( fetestexcept )。

Unfortunately gcc only auto-vectorizes with -ffast-math. It turns out only -fno-trapping-math is required: that's generally safe especially if you aren't using fenv access to unmask any FP exceptions (feenableexcept) or looking at MXCSR sticky FP exception flags (fetestexcept).

通过该选项,GCC也将使用(v)pblendvpd -march = nehalem -march = znver1 。的 查看它Godbolt

With that option, then GCC too will use (v)pblendvpd with -march=nehalem or -march=znver1. See it on Godbolt

此外,您的C函数也损坏了。 th drop 是标量双精度,但是您将它们声明为 const double *

Also, your C function is broken. th and drop are scalar double, but you declare them as const double *

AVX512F可以让您执行!(正确[i]> = thresh)比较并使用结果掩码进行合并掩码减法。

AVX512F would let you do a !(right[i] >= thresh) compare and use the resulting mask for a merge-masked subtract.

谓词为true的元素将获得 left [i]-drop ,其他元素将保留其 left [i] 值,因为您合并了信息 left 值。

Elements where the predicate was true will get left[i] - drop, other elements will keep their left[i] value, because you merge info a vector of left values.

不幸的是,GCC的 -march = skylake-avx512 使用普通的 vsubpd 然后使用单独的 vmovapd zmm2 {k1},zmm5 进行混合,这显然是错过的优化方法。混合目标已经是SUB的输入之一。

Unfortunately GCC with -march=skylake-avx512 uses a normal vsubpd and then a separate vmovapd zmm2{k1}, zmm5 to blend, which is obviously a missed optimization. The blend destination is already one of the inputs to the SUB.

对256位向量使用AVX512VL(以防程序的其余部分无法有效使用512-位,这样您就不会降低Turbo时钟速度):

Using AVX512VL for 256-bit vectors (in case the rest of your program can't efficiently use 512-bit, so you don't suffer reduced turbo clock speeds):

__m256d left = ...;
__m256d right = ...;
__mmask8 cmp = _mm256_cmp_pd_mask(right, set1(th), _CMP_NGE_UQ);
__m256d res = _mm256_mask_sub_pd (left, cmp, left, set1(drop));

因此(除了加载和存储),AVX512F / VL的2条指令。

So (besides the loads and store) it's 2 instructions with AVX512F / VL.

而且它对所有编译器都更有效,因为您只需要AND,而不是可变混合。因此,仅使用SSE2会更好,并且即使在大多数CPU上也是如此确实支持SSE4.1 blendvpd ,因为该指令的效率不高。

And it's more efficient with all compilers because you just need an AND, not a variable-blend. So it's significantly better with just SSE2, and also better on most CPUs even when they do support SSE4.1 blendvpd, because that instruction isn't as efficient.

您可以减去<$ c根据比较结果,从 left [i] 中的$ c> 0.0 drop

You can subtract 0.0 or drop from left[i] based on the compare result.

产生 0.0 或基于比较结果的常数非常有效:只需 andps 指令。 ( 0.0 的位模式为全零,SIMD会比较全1或全0位的产生向量。因此,AND会将旧值保留为零。 )

Producing 0.0 or a constant based on a compare result is extremely efficient: just an andps instruction. (The bit-pattern for 0.0 is all-zeros, and SIMD compares produce vectors of all-1 or all-0 bits. So AND keeps the old value or zeros it.)

我们也可以添加 -drop 而不是减去 drop 。这会在输入上带来额外的否定,但使用AVX时,允许 vaddpd 使用内存源操作数。不过,GCC选择使用索引寻址模式,因此实际上并不能帮助减少Intel CPU的前端uop数量。它将分层。但是,即使使用 -ffast-math ,gcc也不会自行进行此优化以允许折叠负载。 (不过,除非我们展开循环,否则不应该单独进行指针增量操作。)

We can also add -drop instead of subtracting drop. This costs an extra negation on input, but with AVX allows a memory-source operand for vaddpd. GCC chooses to use an indexed addressing mode so that doesn't actually help reduce the front-end uop count on Intel CPUs, though; it will "unlaminate". But even with -ffast-math, gcc doesn't do this optimization on its own to allow folding a load. (It wouldn't be worth doing separate pointer increments unless we unroll the loop, though.)

void func3(const double *__restrict left, const double *__restrict right, double *__restrict res,
  const size_t size, const double th, const double drop)
{
    for (size_t i = 0; i < size; ++i) {
        double add = right[i] >= th ? 0.0 : -drop;
        res[i] = left[i] + add;
    }
}

GCC 9.1的内部循环(无任何 -march 选项,并且没有上述Godbolt链接中的 -fast-math ):

GCC 9.1's inner loop (without any -march options and without -ffast-math) from the Godbolt link above:

# func3 main loop
# gcc -O3 -march=skylake       (without fast-math)
.L33:
    vcmplepd        ymm2, ymm4, YMMWORD PTR [rsi+rax]
    vandnpd ymm2, ymm2, ymm3
    vaddpd  ymm2, ymm2, YMMWORD PTR [rdi+rax]
    vmovupd YMMWORD PTR [rdx+rax], ymm2
    add     rax, 32
    cmp     r8, rax
    jne     .L33

或者普通的SSE2版本具有与 left-zero_or_drop 相同的内部循环,而不是 left + zero_or_minus_drop ,因此,除非您可以保证编译器可以对齐16字节,或者您要制作AVX版本,否则取消 drop 只是额外的开销。

Or the plain SSE2 version has an inner loop that's basically the same as with left - zero_or_drop instead of left + zero_or_minus_drop, so unless you can promise the compiler 16-byte alignment or you're making an AVX version, negating drop is just extra overhead.

取反 drop 从内存中获取一个常量(对符号位进行XOR),并且这是该函数所需的唯一静态常量,因此在循环不会运行很多次的情况下,值得权衡考虑。 (除非 th drop 在内联之后也是编译时常量,并且无论如何都会被加载。特别是如果 -drop 可以在编译时计算,或者是否可以使程序在负的 drop 下工作。)

Negating drop takes a constant from memory (to XOR the sign bit), and that's the only static constant this function needs, so that tradeoff is worth considering for your case where the loop doesn't run a huge number of times. (Unless th or drop are also compile-time constants after inlining, and are getting loaded anyway. Or especially if -drop can be computed at compile time. Or if you can get your program to work with a negative drop.)

加法和减法之间的另一个区别是减法不会破坏零的符号。 -0.0-0.0 = -0.0 +0.0-0.0 = +0.0

Another difference between adding and subtracting is that subtracting doesn't destroy the sign of zero. -0.0 - 0.0 = -0.0, +0.0 - 0.0 = +0.0. In case that matters.

# gcc9.1 -O3
.L26:
    movupd  xmm5, XMMWORD PTR [rsi+rax]
    movapd  xmm2, xmm4                    # duplicate  th
    movupd  xmm6, XMMWORD PTR [rdi+rax]
    cmplepd xmm2, xmm5                    # destroy the copy of th
    andnpd  xmm2, xmm3                    # _mm_andnot_pd
    addpd   xmm2, xmm6                    # _mm_add_pd
    movups  XMMWORD PTR [rdx+rax], xmm2
    add     rax, 16
    cmp     r8, rax
    jne     .L26

GCC使用未对齐的负载,因此(无AVX)它无法将内存源操作数折叠为 cmppd subpd

GCC uses unaligned loads so (without AVX) it can't fold a memory source operand into cmppd or subpd

这篇关于SIMD用于浮动阈值操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆