R在HPC MPIcluster上运行foreach dopar循环 [英] R Running foreach dopar loop on HPC MPIcluster

查看:142
本文介绍了R在HPC MPIcluster上运行foreach dopar循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以访问带有MPI分区的HPC群集.

I got access to an HPC cluster with a MPI partition.

我的问题是-无论我如何尝试-我的代码(在我的PC上都可以正常运行)无法在HPC群集上运行.代码如下:

My problem is that -no matter what I try- my code (which works fine on my PC) doesn't run on the HPC cluster. The code looks like this:

图书馆(tm) 库(qdap) 图书馆(雪) 图书馆(doSNOW) 库(foreach)

library(tm) library(qdap) library(snow) library(doSNOW) library(foreach)

> cl<- makeCluster(30, type="MPI")
> registerDoSNOW(cl)
> np<-getDoParWorkers()
> np
> Base = "./Files1a/"
> files = list.files(path=Base,pattern="\\.txt");
> 
> for(i in 1:length(files)){
...some definitions and variable generation...
+ text<-foreach(k = 1:10, .combine='c') %do%{
+   text= if (file.exists(paste("./Files", k, "a/", files[i], sep=""))) paste(tolower(readLines(paste("./Files", k, "a/", files[i], sep=""))) , collapse=" ") else ""
+ }
+ 
+ docs <- Corpus(VectorSource(text))
+ 
+ for (k in 1:10){
+ ID[k] <- paste(files[i], k, sep="_")
+ }
+ data <- as.data.frame(docs) 
+ data[["docs"]]=ID
+ rm(docs)
+ data <- sentSplit(data, "text")
+ 
+ frequency=NULL
+ cs <- ceiling(length(POLKEY$x) / getDoParWorkers()) 
+ opt <- list(chunkSize=cs) 
+ frequency<-foreach(j = 2: length(POLKEY$x), .options.mpi=opt, .combine='cbind') %dopar% ...
+ write.csv(frequency, file =paste("./Result/output", i, ".csv", sep=""))
+ rm(data, frequency)
+ }

当我运行批处理作业时,会话会在时限内终止.而在MPI群集初始化之后,我收到以下消息:

When I run the batch job the session gets killed at the time limit. Whereas I receive the following message after the MPI cluster initialization:

Loading required namespace: Rmpi
--------------------------------------------------------------------------
PMI2 initialized but returned bad values for size and rank.
This is symptomatic of either a failure to use the
"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
If running under SLURM, try adding "-mpi=pmi2" to your
srun command line. If that doesn't work, or if you are
not running under SLURM, try removing or renaming the
pmi2.h header file so PMI2 support will not automatically
be built, reconfigure and build OMPI, and then try again
with only PMI1 support enabled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:         ...
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
    30 slaves are spawned successfully. 0 failed.

不幸的是,由于没有返回任何输出,因此似乎循环不会一次通过.

Unfortunately, it seems that the loop doesn't go through once as no output is returned.

为了完整起见,我的批处理文件:

For the sake of completeness, my batch file:

#!/bin/bash -l
#SBATCH --job-name MyR
#SBATCH --output MyR-%j.out
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=6
#SBATCH --mem=24gb
#SBATCH --time=00:30:00

MyRProgram="$HOME/R/hpc_test2.R"

cd $HOME/R

export R_LIBS_USER=$HOME/R/Libs2

# start R with my R program
module load R

time R --vanilla -f $MyRProgram

有人建议如何解决问题吗?我在做什么错了?

Does anybody have a suggestion how to solve the problem? What am I doing wrong?

提前感谢您的帮助!

推荐答案

您的脚本是MPI应用程序,因此您需要通过Slurm适当地执行它. Open MPI常见问题解答中有一个特别的部分介绍如何做到这一点:

Your script is an MPI application, so you need to execute it appropriately via Slurm. The Open MPI FAQ has a special section on how to do that:

https://www.open-mpi.org/faq/?category= s喝

最重要的一点是您的脚本不应直接执行R,而应通过mpirun命令使用以下命令来执行它:

The most important point is that your script shouldn't execute R directly, but should execute it via the mpirun command, using something like:

mpirun -np 1 R --vanilla -f $MyRProgram

我的猜测是"PMI2"错误是由于未通过mpirun执行R而引起的.我不认为"fork"消息表示一个真正的问题,有时它会发生在我身上.我认为发生这种情况是因为R在初始化时会调用"fork",但这从未对我造成任何问题.我不确定为什么我偶尔会收到此消息.

My guess is that the "PMI2" error is caused by not executing R via mpirun. I don't think the "fork" message indicates a real problem and it happens to me at times. I think it happens because R calls "fork" when initializing, but this has never caused a problem for me. I'm not sure why I only get this message occasionally.

请注意,告诉mpirun仅启动一个进程非常重要,因为其他进程将被生成,因此您应该使用mpirun -np 1选项.如果Open MPI是在Slurm支持下正确构建的,则Open MPI应在生成它们时知道在何处启动这些进程,但如果不使用-np 1,则通过mpirun启动的所有30个进程将各自生成30个进程,从而一团糟.

Note that it is very important to tell mpirun to only launch one process since the other processes will be spawned, so you should use the mpirun -np 1 option. If Open MPI was properly built with Slurm support, then Open MPI should know where to launch those processes when they are spawned, but if you don't use -np 1, then all 30 processes launched via mpirun will spawn 30 processes each, causing a huge mess.

最后,我认为您应该告诉makeCluster仅产生29个进程,以避免总共运行31个MPI进程.根据您的网络配置,即使过多的预订也会导致问题.

Finally, I think you should tell makeCluster to spawn only 29 processes to avoid running a total of 31 MPI processes. Depending on your network configuration, even that much oversubscription can cause problems.

我将如下创建集群对象:

I would create the cluster object as follows:

library(snow)
library(Rmpi)
cl<- makeCluster(mpi.universe.size() - 1, type="MPI")

这更安全,并且可以更轻松地使R脚本和作业脚本保持同步.

That's safer and makes it easier to keep your R script and job script in sync with each other.

这篇关于R在HPC MPIcluster上运行foreach dopar循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆