并行运行时写入全局环境 [英] writing to global environment when running in parallel
问题描述
我有一个由单元格,值和坐标组成的data.frame.它位于全球环境中.
I have a data.frame of cells, values and coordinates. It resides in the global environment.
> head(cont.values)
cell value x y
1 11117 NA -34 322
2 11118 NA -30 322
3 11119 NA -26 322
4 11120 NA -22 322
5 11121 NA -18 322
6 11122 NA -14 322
因为我的自定义函数花了几乎一秒钟的时间来计算单个单元格(并且我有成千上万个单元格要计算),所以我不想为已经有一个值的单元格重复计算.我下面的解决方案试图避免这种情况.每个单元格都可以独立计算,尖叫着并行执行.
Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.
我的函数实际上所做的是检查指定的单元格号是否存在一个值,并且它是否为NA,它将计算该值并将其插入到NA的位置.
What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.
我可以使用apply函数家族运行我的魔术函数(结果为value
对于相应的cell
),并且从apply
内部,我可以毫无问题地读写cont.values
(在全局环境中).
I can run my magic function (result is value
for a corresponding cell
) using apply family of functions and from within apply
, I can read and write cont.values
without a problem (it's in global environment).
现在,我想并行运行此命令(使用snowfall
),但是我无法从单个内核对该变量进行读/写操作.
Now, I want to run this in parallel (using snowfall
) and I'm unable to read or write from/to this variable from individual core.
问题:当并行执行功能时,哪种解决方案能够从工作人员(核心)内部对全局环境中的动态变量进行读写操作.有更好的方法吗?
Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?
推荐答案
The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet
) and if not do the calculation and store it (redisSet
) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)
fun <- function(x) { Sys.sleep(1); x }
我们编写了一个'memoizer',它返回fun
的变体,该变体首先检查是否已经计算出x
的值,如果使用了该值,则使用
We write a 'memoizer' that returns a variant of fun
that first checks to see if the value for x
has already been calculated, and if so uses that
memoize <-
function(FUN)
{
force(FUN) # circumvent lazy evaluation
require(rredis)
redisConnect()
function(x)
{
key <- as.character(x)
val <- redisGet(key)
if (is.null(val)) {
val <- FUN(x)
redisSet(key, val)
}
val
}
}
然后我们记住我们的功能
We then memoize our function
funmem <- memoize(fun)
走
> system.time(res <- funmem(10)); res
user system elapsed
0.003 0.000 1.082
[1] 10
> system.time(res <- funmem(10)); res
user system elapsed
0.001 0.001 0.040
[1] 10
这确实需要Redis服务器在R之外运行,但是非常容易安装;请参阅rredis软件包随附的文档.
This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.
R内并行版本可能是
library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))
这篇关于并行运行时写入全局环境的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!