如何为并行化编写高效的嵌套函数? [英] How to write efficient nested functions for parallelization?

查看:62
本文介绍了如何为并行化编写高效的嵌套函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两个分组变量 classgroup 的数据框.对于每个班级,我每个小组都有一个绘图任务.大多数情况下,我有 2 个级别 class500 个级别每个 group.

I have a dataframe with two grouping variables class and group. For each class, I have a plotting task per group. Mostly, I have 2 levels per class and 500 levels per group.

我使用 parallel 包进行并行化,并使用 mclapply 函数通过 classgroup 级别进行迭代.

I'm using parallel package for parallelization and mclapply function for the iteration through class and group levels.

我想知道哪种方法是编写迭代的最佳方式.我想我有两个选择:

I'm wondering which is the best way to write my iterations. I think I have two options:

  1. class 变量运行并行化.
  2. group 变量运行并行化.
  1. Run parallelization for class variable.
  2. Run parallelization for group variable.

我的计算机有 3 个内核用于 R 会话,并且通常为我的操作系统保留第 4 个内核.我想知道如果对具有 2 个级别的 class 变量执行并行化,将永远不会使用第 3 个核心,所以我认为确保所有 3 个核心都运行并行化会更有效group 变量.我已经编写了一些速度测试,以确保这是最好的方法:

My computer has 3 cores working for R session and usuarlly, preserve the 4th core for my Operating System. I was wondering that if perform the parallelization for class variable with 2 levels, the 3rd core will never will be used, so I thought that would be more efficient ensuring all 3 cores will be working running the parallelization for group variable. I've written some speed tests to be sure wich is the best way:

library(microbenchmark)
library(parallel)

f = function(class, group, A, B) {

  mclapply(seq(class), mc.cores = A, function(z) {
    mclapply(seq(group), mc.cores = B, function(c) {
      ifelse(class == 1, 'plotA', 'plotB')
    })
  })

}

class = 2
group = 500

microbenchmark(
  up = f(class, group, 3, 1),
  nest = f(class, group, 1, 3),
  times = 50L
)

Unit: milliseconds
 expr       min        lq     mean    median       uq      max neval
   up  6.751193  7.897118 10.89985  9.769894 12.26880 26.87811    50
 nest 16.584382 18.999863 25.54437 22.293591 28.60268 63.49878    50

结果告诉我应该对 class 而不是 group 变量使用并行化.

Result tells that I shoud use the parallelization for class and not for group variable.

概述是我总是应该编写单核函数,然后调用它进行并行化.我认为这样,我的代码会比编写具有并行化功能的嵌套函数更简单或更简化.

The overview would be that I allways shoud write one-core functions and then call it for parallelization. I think this way, my code would be more simple or reductionist, than write nested functions with parallelization capabilities.

使用 ifelse 条件是因为之前用于准备绘图任务数据的代码对于两个 class 级别或多或少都是多余的,所以我认为它会是编写更长的函数检查使用哪个 class 级别比将该函数拆分"为两个更短的函数更有效.

The ifelse condition is used because the previous code used to prepare the data for plotting task is more or less redundant for both class levels, so I thought it would be more line-coding efficient write a longer function checking which class level is used than "splitting" this function in two shorter functions.

编写这种代码的最佳实践是什么?.我很清楚,但因为我不是专业的数据科学家,所以我想知道你的工作方法.

Which is the best practice to write this kind of code?. I seams clear, but because I'm not an expert data-scientist, I would like to know your working approach.

这个威胁就是围绕这个问题.但我认为我的问题是针对这两种观点的:

This threat is around this problem. But I think that my question is for both points of view:

  • 代码美观、清晰
  • 速度表现

谢谢

推荐答案

您之前问过这个问题,但我会尝试回答,以防其他人想知道同样的事情.首先,我喜欢先拆分我的任务,然后循环遍历每个部分.这让我可以更好地控制流程.

You asked this a while ago but I'll attempt an answer in case anyone else was wondering the same thing. First, I like to split up my task first and then loop over each part. This gives me more control over the process.

parts <- split(df, c(df$class, df$group))
mclapply(parts, some_function)

其次,将任务分配到多个核心需要大量的计算开销,并且可以抵消并行化脚本所获得的任何收益.在这里,mclapply 将作业拆分为您拥有的多个节点并执行一次分叉.这比嵌套两个 mclapply 循环要高效得多.

Second, distributing tasks to multiple cores takes a lot of computational overhead and can cancel out any gains your make from paralleizing your script. Here, mclapply splits the job into however many nodes you have and performs the fork once. This is much more efficient than nesting two mclapply loops.

这篇关于如何为并行化编写高效的嵌套函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆