Clojure 中 pmap 的更好替代方案，用于在大数据上并行化适度廉价的函数? [英] Better alternative to pmap in Clojure for parallelizing moderately inexpensive functions over big data?

查看：14 发布时间：2021/12/30 21:46:11 clojure parallel-processing

本文介绍了Clojure 中 pmap 的更好替代方案，用于在大数据上并行化适度廉价的函数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 clojure，我在一个序列中有大量数据，我想以相对较少的内核(4 到 8 个)并行处理它.

Using clojure I have a very large amount of data in a sequence and I want to process it in parallel, with a relatively small number of cores (4 to 8).

最简单的方法是使用 pmap 而不是 map，将我的处理函数映射到数据序列上.但在我的情况下，协调开销导致净损失.

我认为原因是 pmap 假设跨数据映射的函数非常昂贵.查看 pmap 的源代码，它似乎依次为序列的每个元素构造了一个 future，因此函数的每次调用都发生在单独的线程上(在可用内核数上循环).

I think the reason is that pmap assumes the function mapped across the data is very costly. Looking at pmap's source code it appears to construct a future for each element of the sequence in turn so each invocation of the function occurs on a separate thread (cycling over the number of available cores).

这是 pmap 的相关来源:

Here is the relevant piece of pmap's source:

(defn pmap
  "Like map, except f is applied in parallel. Semi-lazy in that the
  parallel computation stays ahead of the consumption, but doesn't
  realize the entire result unless required. Only useful for
  computationally intensive functions where the time of f dominates
  the coordination overhead."
  ([f coll]
   (let [n (+ 2 (.. Runtime getRuntime availableProcessors))
         rets (map #(future (f %)) coll)
         step (fn step [[x & xs :as vs] fs]
                (lazy-seq
                 (if-let [s (seq fs)]
                   (cons (deref x) (step xs (rest s)))
                   (map deref vs))))]
     (step rets (drop n rets))))
  ;; multi-collection form of pmap elided

在我的情况下，映射函数并不昂贵，但序列很大(数百万条记录).我认为创建和取消引用许多期货的成本是并行收益在开销中损失的地方.

In my case the mapped function is not that expensive but sequence is huge (millions of records). I think the cost of creating and dereferencing that many futures is where the parallel gain is lost in overhead.

我对 pmap 的理解是否正确?

Clojure 中是否有比 pmap 成本更低但大量重复处理的更好的模式?我正在考虑以某种方式对数据序列进行分块，然后在更大的块上运行线程.这是一种合理的方法吗?哪些 clojure 习语会起作用?

Is there a better pattern in clojure for this sort of lower cost but massively repeated processing than pmap? I am considering chunking the data sequence somehow and then running threads on larger chunks. Is this a reasonable approach and what clojure idioms would work?

Clojure 中 pmap 的更好替代方案，用于在大数据上并行化适度廉价的函数? [英] Better alternative to pmap in Clojure for parallelizing moderately inexpensive functions over big data?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Clojure 中 pmap 的更好替代方案，用于在大数据上并行化适度廉价的函数? [英] Better alternative to pmap in Clojure for parallelizing moderately inexpensive functions over big data?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭