在R中运行并行计算时如何在worker上设置.libPaths(检查点) [英] How to set .libPaths (checkpoint) on workers when running parallel computation in R

查看:170
本文介绍了在R中运行并行计算时如何在worker上设置.libPaths(检查点)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用检查点包进行可重复的数据分析.有些计算需要很长时间才能计算出来,因此我想并行运行这些计算. 并行运行时,但是未在工作进程上设置检查点,因此我收到一条错误消息没有名为xy的软件包" (因为它没有安装在我的默认库目录中).

如何确保每个工作人员都使用checkpoint文件夹中的软件包版本?我试图在foreach代码中设置.libPaths,但这似乎不起作用.我还希望全局设置一次checkpoint/libPaths,而不是在每个foreach调用中都设置一次.

另一种选择是更改.Rprofile文件,但我不想这样做.

checkpoint::checkpoint("2018-06-01")

library(foreach)
library(doFuture)
library(future)

doFuture::registerDoFuture()
future::plan("multisession")

l <- .libPaths()

# Code to run in parallel does not make much sense of course but I wanted to keep it simple.
res <- foreach::foreach(
  x = unique(iris$Species),
  lib.path = l
) %dopar% {
  .libPaths(lib.path)
  stringr::str_c(x, "_")
}

{中的错误:任务2失败-没有名为'stringr'的软件包"

解决方案

future 包的作者在这里.

将主R进程的库路径作为全局变量libs传递,并使用.libPaths(libs)为每个工作程序设置它就足够了;

## Use CRAN checkpoint from 2018-07-24 to get future (>= 1.9.0) [1],
## otherwise the below stdout won't be relayed back to the master
## R process, but settings .libPaths() does also work in older
## versions of the future package.
## [1] https://cran.microsoft.com/snapshot/2018-07-24/web/packages/future
checkpoint::checkpoint("2018-07-24")
stopifnot(packageVersion("future") >= "1.9.0")

libs <- .libPaths()
print(libs)
### [1] "/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1"
### [2] "/home/hb/.checkpoint/R-3.5.1"                                 
### [3] "/usr/lib/R/library"

library(foreach)

doFuture::registerDoFuture()
future::plan("multisession")

res <- foreach::foreach(x = unique(iris$Species)) %dopar% {
  ## Use the same library paths as the master R session
  .libPaths(libs)

  cat(sprintf("Library paths used by worker (PID %d):\n", Sys.getpid()))
  cat(sprintf(" - %s\n", sQuote(.libPaths())))

  stringr::str_c(x, "_")
}

###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’
### Library paths used by worker (PID 9394):
###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’
### Library paths used by worker (PID 9412):
###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’

str(res)
### List of 3
###  $ : chr "setosa_"
###  $ : chr "versicolor_"
###  $ : chr "virginica_"

仅供参考,将来的路线图,使其更容易沿库路径传递).

我的详细信息:

> sessionInfo()
R version 3.5.1 (2018-07-02)   
Platform: x86_64-pc-linux-gnu (64-bit)   
Running under: Ubuntu 18.04.1 LTS   

Matrix products: default   
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1   
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1   

locale:   
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                     LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C          

attached base packages:   
[1] stats     graphics  grDevices utils     datasets  methods   base        

other attached packages:   
[1] foreach_1.4.4   

loaded via a namespace (and not attached):   
[1] drat_0.1.4         compiler_3.5.1     BiocManager_1.30.2 parallel_3.5.1        tools_3.5.1        listenv_0.7.0      doFuture_0.6.0    
[8] codetools_0.2-15   iterators_1.0.10   digest_0.6.15      globals_0.12.1        checkpoint_0.4.5   future_1.9.0 

I use the checkpoint package for reproducible data analysis. Some of the computations take a long time to compute, so I want to run those in parallel. When run in parallel however the checkpoint is not set on the workers, so I get an error message "there is no package called xy" (because it is not installed in my default library directory).

How can I make sure, that each worker uses the package versions in the checkpoint folder? I tried to set .libPaths in the foreach code but this does not seem to work. I would also prefer to set the checkpoint/libPaths once globally and not in every foreach call.

Another option could be to change the .Rprofile file, but I do not want to do this.

checkpoint::checkpoint("2018-06-01")

library(foreach)
library(doFuture)
library(future)

doFuture::registerDoFuture()
future::plan("multisession")

l <- .libPaths()

# Code to run in parallel does not make much sense of course but I wanted to keep it simple.
res <- foreach::foreach(
  x = unique(iris$Species),
  lib.path = l
) %dopar% {
  .libPaths(lib.path)
  stringr::str_c(x, "_")
}

Error in { : task 2 failed - "there is no package called 'stringr'"

解决方案

Author of the future package here.

Passing the the library path of the master R process as a global variable libs and set it for each worker using .libPaths(libs) should be enough;

## Use CRAN checkpoint from 2018-07-24 to get future (>= 1.9.0) [1],
## otherwise the below stdout won't be relayed back to the master
## R process, but settings .libPaths() does also work in older
## versions of the future package.
## [1] https://cran.microsoft.com/snapshot/2018-07-24/web/packages/future
checkpoint::checkpoint("2018-07-24")
stopifnot(packageVersion("future") >= "1.9.0")

libs <- .libPaths()
print(libs)
### [1] "/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1"
### [2] "/home/hb/.checkpoint/R-3.5.1"                                 
### [3] "/usr/lib/R/library"

library(foreach)

doFuture::registerDoFuture()
future::plan("multisession")

res <- foreach::foreach(x = unique(iris$Species)) %dopar% {
  ## Use the same library paths as the master R session
  .libPaths(libs)

  cat(sprintf("Library paths used by worker (PID %d):\n", Sys.getpid()))
  cat(sprintf(" - %s\n", sQuote(.libPaths())))

  stringr::str_c(x, "_")
}

###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’
### Library paths used by worker (PID 9394):
###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’
### Library paths used by worker (PID 9412):
###  - ‘/home/hb/.checkpoint/2018-07-24/lib/x86_64-pc-linux-gnu/3.5.1’
###   - ‘/home/hb/.checkpoint/R-3.5.1’
###   - ‘/usr/lib/R/library’

str(res)
### List of 3
###  $ : chr "setosa_"
###  $ : chr "versicolor_"
###  $ : chr "virginica_"

FYI, it is on future's roadmap to make it easier to pass down the library path(s) to workers.

My details:

> sessionInfo()
R version 3.5.1 (2018-07-02)   
Platform: x86_64-pc-linux-gnu (64-bit)   
Running under: Ubuntu 18.04.1 LTS   

Matrix products: default   
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1   
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1   

locale:   
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                     LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C          

attached base packages:   
[1] stats     graphics  grDevices utils     datasets  methods   base        

other attached packages:   
[1] foreach_1.4.4   

loaded via a namespace (and not attached):   
[1] drat_0.1.4         compiler_3.5.1     BiocManager_1.30.2 parallel_3.5.1        tools_3.5.1        listenv_0.7.0      doFuture_0.6.0    
[8] codetools_0.2-15   iterators_1.0.10   digest_0.6.15      globals_0.12.1        checkpoint_0.4.5   future_1.9.0 

这篇关于在R中运行并行计算时如何在worker上设置.libPaths(检查点)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆