并行运行时写入全局环境 [英] writing to global environment when running in parallel

查看:105
本文介绍了并行运行时写入全局环境的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由单元格,值和坐标组成的data.frame.它位于全球环境中.

I have a data.frame of cells, values and coordinates. It resides in the global environment.

> head(cont.values)
   cell value   x   y
1 11117    NA -34 322
2 11118    NA -30 322
3 11119    NA -26 322
4 11120    NA -22 322
5 11121    NA -18 322
6 11122    NA -14 322

因为我的自定义函数花了几乎一秒钟的时间来计算单个单元格(并且我有成千上万个单元格要计算),所以我不想为已经有一个值的单元格重复计算.我下面的解决方案试图避免这种情况.每个单元格都可以独立计算,尖叫着并行执行.

Because my custom function takes almost a second to calculate individual cell (and I have tens of thousands of cells to calculate) I don't want to duplicate calculations for cells that already have a value. My following solution tries to avoid that. Each cell can be calculated independently, screaming for parallel execution.

我的函数实际上所做的是检查指定的单元格号是否存在一个值,并且它是否为NA,它将计算该值并将其插入到NA的位置.

What my function actually does is check if there's a value for a specified cell number and if it's NA, it calculates it and inserts it in place of NA.

我可以使用apply函数家族运行我的魔术函数(结果为value对于相应的cell),并且从apply内部,我可以毫无问题地读写cont.values(在全局环境中).

I can run my magic function (result is value for a corresponding cell) using apply family of functions and from within apply, I can read and write cont.values without a problem (it's in global environment).

现在,我想并行运行此命令(使用snowfall),但是我无法从单个内核对该变量进行读/写操作.

Now, I want to run this in parallel (using snowfall) and I'm unable to read or write from/to this variable from individual core.

问题:当并行执行功能时,哪种解决方案能够从工作人员(核心)内部对全局环境中的动态变量进行读写操作.有更好的方法吗?

Question: What solution would be able to read/write from/to a dynamic variable residing in global environment from within worker (core) when executing a function in parallel. Is there a better approach of doing this?

推荐答案

The pattern of a central store that workers consult for values is implemented in the rredis package on CRAN. The idea is that the Redis server maintains a store of key-value pairs (your global data frame, re-implemented). Workers query the server to see if the value has been calculated (redisGet) and if not do the calculation and store it (redisSet) so that other workers can re-use it. Workers can be R scripts, so it's easy to expand the work force. It's a very nice alternative parallel paradigm. Here's an example that uses the notion of 'memoizing' each result. We have a function that is slow (sleeps for a second)

fun <- function(x) { Sys.sleep(1); x }

我们编写了一个'memoizer',它返回fun的变体,该变体首先检查是否已经计算出x的值,如果使用了该值,则使用

We write a 'memoizer' that returns a variant of fun that first checks to see if the value for x has already been calculated, and if so uses that

memoize <-
    function(FUN)
{
    force(FUN) # circumvent lazy evaluation
    require(rredis)
    redisConnect()
    function(x)
    {
        key <- as.character(x)
        val <- redisGet(key)
        if (is.null(val)) {
            val <- FUN(x)
            redisSet(key, val)
        }
        val
    }
}

然后我们记住我们的功能

We then memoize our function

funmem <- memoize(fun)

> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.003   0.000   1.082 
[1] 10
> system.time(res <- funmem(10)); res
   user  system elapsed 
  0.001   0.001   0.040 
[1] 10

这确实需要Redis服务器在R之外运行,但是非常容易安装;请参阅rredis软件包随附的文档.

This does require a redis server running outside R but very easy to install; see the documentation that comes with the rredis package.

R内并行版本可能是

library(snow)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
clusterEvalQ(cl, { require(rredis); redisConnect() })
tasks <- sample(1:5, 100, TRUE)
system.time(res <- parSapply(cl, tasks, funmem))

这篇关于并行运行时写入全局环境的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆