是否使用R中的detectCores函数指定用于并行处理的内核数? [英] Whether to use the detectCores function in R to specify the number of cores for parallel processing?

查看:558
本文介绍了是否使用R中的detectCores函数指定用于并行处理的内核数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

帮助detectCores()它说:

这不适合直接用于的mc.cores参数 mclapply也不指定makeCluster中的内核数.第一的 因为它可能会返回NA,其次是因为它不给 允许的内核数.

This is not suitable for use directly for the mc.cores argument of mclapply nor specifying the number of cores in makeCluster. First because it may return NA, and second because it does not give the number of allowed cores.

但是,我已经看到了很多示例代码,如下所示:

However, I've seen quite a bit of sample code like the following:

library(parallel)
k <- 1000
m <- lapply(1:7, function(X) matrix(rnorm(k^2), nrow=k))

cl <- makeCluster(detectCores() - 1, type = "FORK")
test <- parLapply(cl, m, solve)
stopCluster(cl)

其中detectCores()用于指定makeCluster中的内核数.

where detectCores() is used to specify the number of cores in makeCluster.

我的用例涉及在我自己的多核笔记本电脑(OSX)上运行并行处理,以及在各种多核服务器上(Linux)运行并行处理.因此,我不确定是否有更好的方法来指定内核数量,或者对于不打算使用detectCores的建议是否更适合于要在广泛的硬件和OS环境上运行代码的程序包开发人员.

My use cases involve running parallel processing both on my own multicore laptop (OSX) and running it on various multicore servers (Linux). So, I wasn't sure whether there is a better way to specify the number of cores or whether perhaps that advice about not using detectCores was more for package developers where code is meant to run over a wide range of hardware and OS environments.

总而言之:

  • 是否应该在R中使用detectCores函数来指定用于并行处理的内核数?
  • 检测到的核心和允许的核心之间的区别是什么,什么时候有意义?
  • Should you use the detectCores function in R to specify the number of cores for parallel processing?
  • What is the distinction mean between detected and allowed cores and when is it relevant?

推荐答案

我认为使用detectCores作为调用mclapplymakeCluster时的工作程序/进程数量的起点是完全合理的.但是,有很多原因使您可能想要或需要减少工作人员,甚至在某些情况下您可以合理地增加工作人员.

I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.

例如,在某些超线程计算机上,设置mc.cores=detectCores()可能不是一个好主意.或者,如果您的脚本在HPC群集上运行,则不应使用超出作业计划程序分配给作业的资源.在嵌套并行情况下,您还必须小心,例如当您的代码可能被调用函数并行执行时,或者您正在并行执行多线程函数时.通常,在开始长期工作以确定最佳员工数量之前,运行一些初步基准测试是一个好主意.我通常使用top监视基准,以查看进程和线程的数量是否有意义,并验证内存使用是否合理.

On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.

您引用的建议特别适合软件包开发人员.对于包开发人员而言,在调用mclapplymakeCluster时始终启动detectCores()工作人员无疑是一个坏主意,因此最好将决定权交给最终用户.至少该程序包应允许用户指定要启动的工作程序数量,但是可以说detectCores()甚至不是一个很好的默认值.这就是为什么parallel软件包中包含mclapplymc.cores的默认值从detectCores()更改为getOptions("mc.cores", 2L)的原因.

The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.

我认为您引用的警告的真正含义是R函数不应假定它们拥有整个计算机,或者它们是脚本中使用多个内核的唯一函数.如果您在提交给CRAN的程序包中用mc.cores=detectCores()调用mclapply,我希望您的程序包将被拒绝,直到您更改它为止.但是,如果您是最终用户,并且在自己的计算机上运行并行脚本,则由您决定允许该脚本使用多少个内核.

I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.

这篇关于是否使用R中的detectCores函数指定用于并行处理的内核数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆