如何为并行化编写高效的嵌套函数? [英] How to write efficient nested functions for parallelization?
问题描述
我有一个包含两个分组变量 class
和 group
的数据框.对于每个班级,我每个小组都有一个绘图任务.大多数情况下,我有 2 个级别 class
和 500 个级别每个 group
.
I have a dataframe with two grouping variables class
and group
. For each class, I have a plotting task per group.
Mostly, I have 2 levels per class
and 500 levels per group
.
我使用 parallel
包进行并行化,并使用 mclapply
函数通过 class
和 group
级别进行迭代.
I'm using parallel
package for parallelization and mclapply
function for the iteration through class
and group
levels.
我想知道哪种方法是编写迭代的最佳方式.我想我有两个选择:
I'm wondering which is the best way to write my iterations. I think I have two options:
- 为
class
变量运行并行化. - 为
group
变量运行并行化.
- Run parallelization for
class
variable. - Run parallelization for
group
variable.
我的计算机有 3 个内核用于 R 会话,并且通常为我的操作系统保留第 4 个内核.我想知道如果对具有 2 个级别的 class
变量执行并行化,将永远不会使用第 3 个核心,所以我认为确保所有 3 个核心都运行并行化会更有效group
变量.我已经编写了一些速度测试,以确保这是最好的方法:
My computer has 3 cores working for R session and usuarlly, preserve the 4th core for my Operating System. I was wondering that if perform the parallelization for class
variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group
variable. I've written some speed tests to be sure wich is the best way:
library(microbenchmark)
library(parallel)
f = function(class, group, A, B) {
mclapply(seq(class), mc.cores = A, function(z) {
mclapply(seq(group), mc.cores = B, function(c) {
ifelse(class == 1, 'plotA', 'plotB')
})
})
}
class = 2
group = 500
microbenchmark(
up = f(class, group, 3, 1),
nest = f(class, group, 1, 3),
times = 50L
)
Unit: milliseconds
expr min lq mean median uq max neval
up 6.751193 7.897118 10.89985 9.769894 12.26880 26.87811 50
nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878 50
结果告诉我应该对 class
而不是 group
变量使用并行化.
Result tells that I shoud use the parallelization for class
and not for group
variable.
概述是我总是应该编写单核函数,然后调用它进行并行化.我认为这样,我的代码会比编写具有并行化功能的嵌套函数更简单或更简化.
The overview would be that I allways shoud write one-core functions and then call it for parallelization. I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.
使用 ifelse
条件是因为之前用于准备绘图任务数据的代码对于两个 class
级别或多或少都是多余的,所以我认为它会是编写更长的函数检查使用哪个 class
级别比将该函数拆分"为两个更短的函数更有效.
The ifelse
condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class
levels, so I thought it would be more line-coding efficient write a longer function checking which class
level is used than "splitting" this function in two shorter functions.
编写这种代码的最佳实践是什么?.我很清楚,但因为我不是专业的数据科学家,所以我想知道你的工作方法.
Which is the best practice to write this kind of code?. I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.
这个威胁就是围绕这个问题.但我认为我的问题是针对这两种观点的:
This threat is around this problem. But I think that my question is for both points of view:
- 代码美观、清晰
- 速度表现
谢谢
推荐答案
您之前问过这个问题,但我会尝试回答,以防其他人想知道同样的事情.首先,我喜欢先拆分我的任务,然后循环遍历每个部分.这让我可以更好地控制流程.
You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing. First, I like to split up my task first and then loop over each part. This gives me more control over the process.
parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)
其次,将任务分配到多个核心需要大量的计算开销,并且可以抵消并行化脚本所获得的任何收益.在这里,mclapply
将作业拆分为您拥有的多个节点并执行一次分叉.这比嵌套两个 mclapply
循环要高效得多.
Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script. Here, mclapply
splits the job into however many nodes you have and performs the fork once. This is much more efficient than nesting two mclapply
loops.
这篇关于如何为并行化编写高效的嵌套函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!