AVX2中排序数组的有效稳定和 [英] Efficient stable sum of a sorted array in AVX2

查看:88
本文介绍了AVX2中排序数组的有效稳定和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑对 double 个数字进行排序(升序)的数组.为了保持数值稳定性,应该对数组求和,就像从头到尾迭代一样,将总和累加到某个变量中.

Consider a sorted (ascending) array of double numbers. For numerical stability the array should be summed up as if iterating it from the beginning till the end, accumulating the sum in some variable.

如何使用AVX2有效地矢量化?

How to vectorize this efficiently with AVX2?

我已经研究了这种方法最快方法使用AVX指令进行水平向量求和,但是将其缩放为数组似乎很棘手(可能需要一些分而治之的方法),同时通过确保对小数求和来保持浮点精度在将它们添加到更大数量之前.

I've looked into this method Fastest way to do horizontal vector sum with AVX instructions , but it seems quite tricky to scale it to an array (some divide&conquer approach may be needed), while keeping the floating-point precision by ensuring that small numbers are summed up before adding them to a larger number.

说明1 :例如,我认为应该没问题对前4个项求和,然后将它们加到接下来的4个项的总和中,依此类推.我愿意为性能而牺牲一些稳定性.但是我更喜欢一种不会完全破坏稳定性的方法.

Clarification 1: I think it should be ok to e.g. sum the first 4 items, then add them to the sum of the next 4 items, etc. I'm willing to trade some stability for performance. But I would prefer a method that doesn't ruin the stability completely.

澄清2 :内存不应该成为瓶颈,因为该阵列位于L3高速缓存中(但不能位于L1/L2高速缓存中,因为该阵列的各个部分是从不同的线程填充的).我不想求助于Kahan求和,因为我认为真正重要的是运算数量,而Kahan求和将使它增加约4倍.

Clarification 2: memory shouldn't be a bottleneck because the array is in L3 cache (but not in L1/L2 cache, because pieces of the array were populated from different threads). I wouldn't like to resort to Kahan summation because I think it's really the number of operations that matters, and Kahan summation would increase it about 4 times.

推荐答案

如果需要精确的并行性,请使用Kahan求和或另一种错误补偿技术来对和进行重新排序(到SIMD向量中)元素具有多个累加器).

If you need precision and parallelism, use Kahan summation or another error-compensation technique to let you reorder your sum (into SIMD vector element strides with multiple accumulators).

双重快速求和-Evgeny Latkin 指出,如果您瓶颈在内存带宽上,误差补偿后的总和不会比最大性能总和慢很多,因为CPU具有大量的计算吞吐量,而这些吞吐量没有用在简单并行的总和中,而这会阻塞内存带宽.

As Twofold fast summation - Evgeny Latkin points out, if you bottleneck on memory bandwidth, an error-compensated sum isn't much slower than a max-performance sum, since the CPU has lots of computation throughput that goes unused in a simply-parallelized sum that bottlenecks on memory bandwidth

另请参阅( kahan sumv avx 的Google搜索结果)

See also (google results for kahan summation avx)

回复:您的想法:按顺序累加4个数字的组将使您隐藏FP添加延迟,并在标量添加吞吐量上遇到瓶颈.

Re: your idea: Summing groups of 4 numbers in-order will let you hide the FP-add latency, and bottleneck on scalar add throughput.

在向量中进行水平求和需要大量的改组,因此这是一个潜在的瓶颈.您可能考虑加载 a0 a1 a2 a3 ,然后改组以获得 a0 + a1 x a2 + a3 x ,然后(a0 + a1)+(a2 + a3).你有一个Ryzen,对不对?最后一步只是 vextractf128 到128b.仍然是3个ADD运算符和3个shuffle运算符,但指令数量少于标量或128b向量.

Doing horizontal sums within vectors takes a lot of shuffling, so it's a potential bottleneck. You might consider loading a0 a1 a2 a3, then shuffling to get a0+a1 x a2+a3 x, then (a0+a1) + (a2+a3). You have a Ryzen, right? The last step is just a vextractf128 down to 128b. That's still 3 total ADD uops, and 3 shuffle uops, but with fewer instructions than scalar or 128b vectors.

您总是会遇到一些舍入错误,但是添加相似大小的数字会将其最小化.

You're always going to get some rounding error, but adding numbers of similar magnitude minimizes it.

另请参见 Simd matmul程序提供了不同的数值结果,了解有关成对求和和简单高效SIMD的一些评论.

See also Simd matmul program gives different numerical results for some comments about Pairwise Summation and simple efficient SIMD.

添加4个连续数字与垂直添加4个SIMD向量之间的差异可能可以忽略不计.SIMD向量使您在数组中(SIMD向量宽度)有很小的跨度.除非阵列增长得非常快,否则它们的幅度仍将基本相同.

The difference between adding 4 contiguous numbers vs. vertically adding 4 SIMD vectors is probably negligible. SIMD vectors give you small strides (of SIMD vector width) in the array. Unless the array grows extremely quickly, they're still going to have basically similar magnitudes.

您不需要水平累加,直到最后仍然可以获得大部分好处.您可以维护1或2个SIMD向量累加器,同时使用更多的SIMD寄存器求和短期(可能是4个或8个SIMD向量),然后再添加到主累加器中.

You don't need to horizontal sum until the very end to still get most of the benefit. You can maintain 1 or 2 SIMD vector accumulators while you use more SIMD registers to sum short runs (of maybe 4 or 8 SIMD vectors) before adding into the main accumulators.

实际上,将您的总数划分为更多的方式(跨SIMD向量元素)意味着它不会增长得那么大.因此,它可以准确地解决您要避免的问题,水平汇总到单个标量累加器实际上会使情况变得更糟,尤其是对于严格排序的数组.

In fact having your total split more ways (across the SIMD vector elements) means it doesn't grow as large. So it helps with exactly the problem you're trying to avoid, and horizontal summing down to a single scalar accumulator actually makes things worse, especially for a strictly sorted array.

对于乱序执行,您不需要太多的tmp累加器即可完成这项工作,并且可以将累加到主累加器中的FP-add延迟隐藏起来.您可以进行几组累加到一个新的 tmp = _mm_load_ps()向量中,并将其添加到总数中,OoO exec将与这些执行重叠.因此,您的主循环不需要很大的展开系数.

With out-of-order execution, you don't need very many tmp accumulators to make this work and hide the FP-add latency of accumulating into the main accumulators. You can do a couple groups of accumulating into a fresh tmp = _mm_load_ps() vector and adding that to the total, and OoO exec will overlap those executions. So you don't need a huge unroll factor for your main loop.

但是它不能太小,您不想在主累加器上添加瓶颈.您想成为FP-add吞吐量的瓶颈.(或者,如果您关心Broadwell/Haswell,而又不完全限制内存带宽,则可以将某些FMA与 1.0 乘数混合使用,以利用该吞吐量.)

But it shouldn't be too small, you don't want to bottleneck on the add into the main accumulator. You want to bottleneck on FP-add throughput. (Or if you care about Broadwell/Haswell and you don't totally bottleneck on memory bandwidth, mix in some FMA with a 1.0 multiplier to take advantage of that throughput.)

例如Skylake SIMD FP加法器具有4个周期的延迟,0.5个周期的吞吐量,因此对于单个累加器中的每个加法,您都需要做至少7个加法,它们是短dep链的一部分.优选地,和/或优选地,具有2个长期累加器,以便在调度中更好地吸收资源冲突中的泡沫.

e.g. Skylake SIMD FP add has 4 cycle latency, 0.5 cycle throughput, so you need to be doing at least 7 adds that are part of a short dep chain for every add into a single accumulator. Preferably more, and/or preferably with 2 long-term accumulators to better absorb bubbles in scheduling from resource conflicts.

请参见 _mm256_fmadd_ps慢于_mm256_mul_ps + _mm256_add_ps?了解有关多个累加器的更多信息.

See _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps? for more about multiple accumulators.

这篇关于AVX2中排序数组的有效稳定和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆