使用R的doParallel软件包的多核计算会占用更多的内存吗? [英] Does multicore computing using R's doParallel package use more memory?

查看:182
本文介绍了使用R的doParallel软件包的多核计算会占用更多的内存吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚测试了带有或不带有并行后端的弹性网.电话是:

I just tested an elastic net with and without a parallel backend. The call is:

enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )

我在未注册并行后端的情况下运行了它(并在train调用完成时从%dopar%得到了警告消息),然后又为7个内核(共8个)注册了一个.第一次运行耗时529秒,第二次运行耗时313秒.但是第一次运行耗时最大3.3 GB内存(由Sun Cluster系统报告),第二次耗时22.9 GB.我有30GB的内存,而任务从这里开始只会变得更加复杂.

I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. I've got 30GB of ram, and the task only gets more complicated from here.

问题: 1)这是并行计算的一般属性吗?我以为他们共享记忆... 2)在train内仍使用enet时是否有解决办法?如果doParallel是问题,%dopar%是否可以使用其他体系结构-不,对吗?

Questions: 1) Is this a general property of parallel computation? I thought they shared memory.... 2) Is there a way around this while still using enet inside train? If doParallel is the problem, are there other architectures that I could use with %dopar%--no, right?

因为我对这是否是预期的结果感兴趣,所以这与该问题密切相关,但并不完全相同,但是我可以将其结束并将其合并到该问题中(或将其标记为重复项)并指出这一点,因为这有更多细节),如果这就是共识:

Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is:

新doParallel软件包的内存消耗极高

推荐答案

在多线程程序中,线程共享大量内存.它主要是线程之间不共享的堆栈.但是,用Dirk Eddelbuettel的话来说,"R是并且将保持单线程",因此R并行程序包使用进程而不是线程,因此共享内存的机会要少得多.

In multithreaded programs, threads share lots of memory. It's primarily the stack that isn't shared between threads. But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory.

但是,内存在由mclapply分叉的进程之间共享(只要进程不对其进行修改,这会触发操作系统中的内存区域的副本).这是使用多核" API而不是带有"parallel/doParallel"的"snow" API时内存占用空间较小的原因之一.

However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel.

换句话说,使用:

registerDoParallel(7)

与使用以下命令相比,可能会大大提高内存效率:

may be much more memory efficient than using:

cl <- makeCluster(7)
registerDoParallel(cl)

因为前者会导致%dopar%在Linux和Mac OS X上使用mclapply,而后者会使用clusterApplyLB.

since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB.

但是,"snow" API允许您使用多台计算机,这意味着您的内存大小会随着CPU数量的增加而增加.这是一个很大的优势,因为它可以扩展程序.一些程序甚至可以在集群上并行运行时获得超线性加速,因为它们可以访问更多内存.

However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. This is a great advantage since it can allow programs to scale. Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory.

因此,要回答第二个问题,如果您只有一台计算机并且正在使用Linux或Mac OS X,而我将"snow" API与多个计算机一起使用,那么我想在doParallel中使用多核" API机器(如果您正在使用群集).我认为没有任何方法可以将共享内存包(例如Rdsm)与caret包一起使用.

So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. I don't think there is any way to use shared memory packages such as Rdsm with the caret package.

这篇关于使用R的doParallel软件包的多核计算会占用更多的内存吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆