Clojure 中 pmap 的更好替代方案,用于在大数据上并行化适度廉价的函数? [英] Better alternative to pmap in Clojure for parallelizing moderately inexpensive functions over big data?
问题描述
使用 clojure,我在一个序列中有大量数据,我想以相对较少的内核(4 到 8 个)并行处理它.
Using clojure I have a very large amount of data in a sequence and I want to process it in parallel, with a relatively small number of cores (4 to 8).
最简单的方法是使用 pmap
而不是 map
,将我的处理函数映射到数据序列上.但在我的情况下,协调开销导致净损失.
我认为原因是 pmap
假设跨数据映射的函数非常昂贵.查看 pmap 的源代码,它似乎依次为序列的每个元素构造了一个 future
,因此函数的每次调用都发生在单独的线程上(在可用内核数上循环).
I think the reason is that pmap
assumes the function mapped across the data is very costly. Looking at pmap's source code it appears to construct a future
for each element of the sequence in turn so each invocation of the function occurs on a separate thread (cycling over the number of available cores).
这是 pmap 的相关来源:
Here is the relevant piece of pmap's source:
(defn pmap
"Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates
the coordination overhead."
([f coll]
(let [n (+ 2 (.. Runtime getRuntime availableProcessors))
rets (map #(future (f %)) coll)
step (fn step [[x & xs :as vs] fs]
(lazy-seq
(if-let [s (seq fs)]
(cons (deref x) (step xs (rest s)))
(map deref vs))))]
(step rets (drop n rets))))
;; multi-collection form of pmap elided
在我的情况下,映射函数并不昂贵,但序列很大(数百万条记录).我认为创建和取消引用许多期货的成本是并行收益在开销中损失的地方.
In my case the mapped function is not that expensive but sequence is huge (millions of records). I think the cost of creating and dereferencing that many futures is where the parallel gain is lost in overhead.
我对 pmap
的理解是否正确?
Clojure 中是否有比 pmap
成本更低但大量重复处理的更好的模式?我正在考虑以某种方式对数据序列进行分块,然后在更大的块上运行线程.这是一种合理的方法吗?哪些 clojure 习语会起作用?
Is there a better pattern in clojure for this sort of lower cost but massively repeated processing than pmap
? I am considering chunking the data sequence somehow and then running threads on larger chunks. Is this a reasonable approach and what clojure idioms would work?
推荐答案
这个问题:how-to-efficient-apply-a-medium-weight-function-in-parallel 也在非常相似的上下文中解决了这个问题.
This question: how-to-efficiently-apply-a-medium-weight-function-in-parallel also addresses this problem in a very similar context.
目前最好的答案是使用 partition
将其分成块.然后将映射函数映射到每个块上.然后重新组合结果.map-reduce-style.
The current best answer is to use partition
to break it into chunks. then pmap a map function onto each chunk. then recombine the results. map-reduce-style.
这篇关于Clojure 中 pmap 的更好替代方案,用于在大数据上并行化适度廉价的函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!