是否可以“减少”函数可以在函数式编程中并行化? [英] Could the "reduce" function be parallelized in Functional Programming?

查看:148
本文介绍了是否可以“减少”函数可以在函数式编程中并行化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

函数式编程中, map 函数的一个好处是它可以被实现为在并行

所以在一个4核硬件上,这个代码和一个并行的 map 实现将允许4个值

  let numbers = [0,1,2,3 ] 
let increaseNumbers = numbers.map {$ 0 + 1}

好的,现在让我们说说关于 reduce 函数。


返回重复调用结合的结果
的值初始化为初始值,并将每个元素的自身依次,即
返回结合起来(combine(combine)(combine(initial,self [0]),
self [1] ),... self [count-2]),self [count-1])。

我的问题: code> reduce 函数可以并行执行吗?
或者,根据定义,它是只能按顺序执行的东西



示例:

  let sum = numbers.reduce(0){$ 0 + $ 1} 


解决方案

最常见的缩减之一是所有元素的总和。

 ((a + b)+ c)+ d ==(a + b)+(c + d)#联想
a + b == b + a#交换

该等式适用于整数,因此您可以将操作的顺序从一个长依赖关系链更改为多个较短依赖关系链,允许多线程和SIMD并行。



数学实数也是如此, 但不适用于浮点数 。在许多情况下,预计不会发生灾难性取消,因此最终结果将足够接近巨大的性能收益。对于C / C ++编译器,这是由 -ffast-math 选项启用的优化之一。 (对于 -ffast-math 的这部分,有一个 -fassociative-math 选项,没有关于缺少无穷大和NaNs。)



如果一个宽负载无法获得多个有用值,那么很难获得很多SIMD加速。英特尔的AVX2增加了聚集负载,但开销非常高。有了Haswell,只需使用标量代码通常会更快,但稍后的微架构确实具有更快的收集。所以SIMD的减少对 或连续存储的其他数据更有效。

现代SIMD硬件通过加载2 连续的双精度浮点数到一个向量寄存器(例如,带有16B向量,如 sse )。有一个packed-FP-add指令添加两个向量的相应元素。所谓的垂直矢量操作(其中两个矢量中的相应元素之间发生相同的操作)比水平操作便宜得多(将两个 double s添加到一个向量)。






因此,在汇编级别,您有一个循环将所有偶数元素分为矢量累加器的一半,所有奇数元素分成另一半。然后一个水平操作结束它们。因此,即使没有多线程,使用SIMD也需要关联操作(或者至少足够接近关联,通常就像浮点一样)。如果您的输入有一个近似的模式,比如+1.001,-0.999,那么将一个大的积极数据加到一个大的负数数字上的取消错误可能会比每次取消都单独发生时差得多。

对于更宽的向量或更窄的元素,矢量累加器将包含更多元素,从而增加SIMD的优势。现代硬件具有流水线执行单元,每个时钟可以支持一次(或者有时两次)FP矢量加法,但每个时钟的结果还没有准备好5个周期。饱和硬件的吞吐能力需要在循环中使用多个累加器,因此有5个或10个独立的循环承载依赖链。 (具体来说,英特尔Skylake可以执行矢量FP乘法,加法或FMA(融合乘法加法),具有4c延迟和每0.5c的吞吐量一个.4c / 0.5c = 8个飞行时间一次飞行,用于饱和Skylake的FP数学每个操作可以是8个单精度浮点,4个双精度浮点,16B矢量或标量的32B向量(在飞行中保持多个操作也可以加速标量,但是如果有任何数据 - 可以使用多级并行性,也可以使用多个累加器对其进行矢量化。)请参阅 http://agner.org/optimize/ 对于x86指令时序,流水线描述和asm优化等,但请注意,这里的所有内容都适用于ARM,NEON,PPC Altivec和其他SIMD架构,它们都具有向量寄存器和类似的矢量指令。



有关具体示例,以下是gcc 5.3 au以矢量化FP总和降低。它只使用一个累加器,因此Skylake的吞吐量为8倍。铿锵有点聪明,使用两个累加器,但不像循环展开因子那样多,以获得1 / Skylake的最大吞吐量为4。请注意,如果从编译选项中取出 -ffast-math ,则FP循环将使用 addss (将标量单)而不是 addps (add packed single)。整数循环仍然是自动矢量化的,因为整数运算是关联的。



实际上,大多数情况下,内存带宽是限制因素。 Haswell和后来的Intel CPU可以在L1缓存的每个周期支持两个32B的负载。 理论上,他们可以承受来自二级缓存。共享的三级缓存是另一回事:它比主内存快很多,但其带宽由所有内核共享。这使得L1或L2的缓存阻塞(aka )是一个非常重要的优化,当它可以当处理超过256k的数据时,可以廉价地完成。而不是生产并减少10MiB的数据,生产128k块,并在它们仍处于L2高速缓存时减少它们,而不是生产者不得不将它们推送到主存储器,并且减速器不得不将它们带回。当在更高级的语言,你最好的选择可能是希望实现为你做到这一点。尽管如此,理想情况下,这是你想要发生的事情。

请注意,所有SIMD加速内容都适用于在连续运行的单个线程内大量的记忆。 您(或您的功能语言的编译器!)可以并且应该使用这两种技术,让多个线程分别饱和运行核心上的执行单元。






对不起,此答案中缺少函数式编程。由于SIMD标签,您可能已经猜到我看到了这个问题。 :P



我不打算尝试从加法到其他操作进行概括。 IDK什么样的东西你功能编程人员减少,但增加或比较(找到最小/最大值,计数匹配)是那些可以用作SIMD优化示例。


In Functional Programming, one benefit of the map function is that it could be implemented to be executed in parallel.

So on a 4 cores hardware, this code and a parallel implementation of map would allow the 4 values to be processed at the same time.

let numbers = [0,1,2,3]
let increasedNumbers = numbers.map { $0 + 1 }

Fine, now lets talk about the reduce function.

Return the result of repeatedly calling combine with an accumulated value initialized to initial and each element of self, in turn, i.e. return combine(combine(...combine(combine(initial, self[0]), self[1]),...self[count-2]), self[count-1]).

My question: could the reduce function be implemented so to be executed in parallel? Or, by definition, it is something that can only be executed sequentially?

Example:

let sum = numbers.reduce(0) { $0 + $1 }

解决方案

One of the most common reductions is the sum of all elements.

((a+b) + c) + d == (a + b) + (c+d)  # associative
a+b == b+a                          # commutative

That equality works for integers, so you can change the order of operations from one long dependency chain to multiple shorter dependency chains, allowing multithreading and SIMD parallelism.

It's also true for mathematical real numbers, but not for floating point numbers. In many cases, catastrophic cancellation is not expected, so the final result will be close enough to be worth the massive performance gain. For C/C++ compilers, this is one of the optimizations enabled by the -ffast-math option. (There's a -fassociative-math option for just this part of -ffast-math, without the assumptions about lack of infinities and NaNs.)

It's hard to get much SIMD speedup if one wide load can't scoop up multiple useful values. Intel's AVX2 added "gathered" loads, but the overhead is very high. With Haswell, it's typically faster to just use scalar code, but later microarchitectures do have faster gathers. So SIMD reduction is much more effective on arrays, or other data that is stored contiguously.

Modern SIMD hardware works by loading 2 consecutive double-precision floats into a vector register (for example, with 16B vectors like 's ). There is a packed-FP-add instruction that adds the corresponding elements of two vectors. So-called "vertical" vector operations (where the same operation happens between corresponding elements in two vectors) are much cheaper than "horizontal" operations (adding the two doubles in one vector to each other).


So at the asm level, you have a loop that sums all the even-numbered elements into one half of a vector accumulator, and all the odd-numbered elements into the other half. Then one horizontal operation at the end combines them. So even without multithreading, using SIMD requires associative operations (or at least, close enough to associative, like floating point usually is). If there's an approximate pattern in your input, like +1.001, -0.999, the cancellation errors from adding one big positive to one big negative number could be much worse than if each cancellation had happened separately.

With wider vectors, or narrower elements, a vector accumulator will hold more elements, increasing the benefit of SIMD.

Modern hardware has pipelined execution units that can sustain one (or sometimes two) FP vector-adds per clock, but the result of each one isn't ready for 5 cycles. Saturating the hardware's throughput capabilities requires using multiple accumulators in the loop, so there are 5 or 10 separate loop-carried dependency chains. (To be concrete, Intel Skylake does vector-FP multiply, add, or FMA (fused multiply-add) with 4c latency and one per 0.5c throughput. 4c/0.5c = 8 FP additions in flight at once to saturate Skylake's FP math unit. Each operation can be a 32B vector of eight single-precision floats, four double-precision floats, a 16B vector, or a scalar. (Keeping multiple operations in flight can speed up scalar stuff, too, but if there's any data-level parallelism available, you can probably vectorize it as well as use multiple accumulators.) See http://agner.org/optimize/ for x86 instruction timings, pipeline descriptions, and asm optimization stuff. But note that everything here applies to ARM with NEON, PPC Altivec, and other SIMD architectures. They all have vector registers and similar vector instructions.

For a concrete example, here's how gcc 5.3 auto-vectorizes a FP sum reduction. It only uses a single accumulator, so it's missing out on a factor of 8 throughput for Skylake. clang is a bit more clever, and uses two accumulators, but not as many as the loop unroll factor to get 1/4 of Skylake's max throughput. Note that if you take out -ffast-math from the compile options, the FP loop uses addss (add scalar single) rather than addps (add packed single). The integer loop still auto-vectorizes, because integer math is associative.

In practice, memory bandwidth is the limiting factor most of the time. Haswell and later Intel CPUs can sustain two 32B loads per cycle from L1 cache. In theory, they could sustain that from L2 cache. The shared L3 cache is another story: it's a lot faster than main memory, but its bandwidth is shared by all cores. This makes cache-blocking (aka loop tiling) for L1 or L2 a very important optimization when it can be done cheaply, when working with more than 256k of data. Rather than producing and then reducing 10MiB of data, produce in 128k chunks and reduce them while they're still in L2 cache instead of the producer having to push them to main memory and the reducer having to bring them back in. When working in a higher level language, your best bet may be to hope that the implementation does this for you. This is what you ideally want to happen in terms of what the CPU actually does, though.

Note that all the SIMD speedup stuff applies within a single thread operating on a contiguous chunk of memory. You (or the compiler for your functional language!) can and should use both techniques, to have multiple threads each saturating the execution units on the core they're running on.


Sorry for the lack of functional-programming in this answer. You may have guessed that I saw this question because of the SIMD tag. :P

I'm not going to try to generalize from addition to other operations. IDK what kind of stuff you functional-programming guys get up to with reductions, but addition or compare (find min/max, count matches) are the ones that get used as SIMD-optimization examples.

这篇关于是否可以“减少”函数可以在函数式编程中并行化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆