在R中使用foreach读取全局变量 [英] reading global variables using foreach in R

查看:530
本文介绍了在R中使用foreach读取全局变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一个16核心的CPU和64 GB的RAM使用RStudio的Windows服务器上运行一个foreach循环。 (使用doParallel包)

worker进程拷贝for循环之外的所有变量(通过在windows任务管理器中观察这些进程的实例化来观察foreach循环运行),从而膨胀了每个进程使用的内存。我试图将一些特别大的变量声明为全局变量,同时确保这些变量也是从foreach循环中读取,而不是写入到foreach循环中以避免冲突。但是,这些进程仍然很快耗尽了所有可用的内存。

有没有一种机制可以确保worker进程不会创建一些只读只有变量?例如声明这样的变量的具体方法?

解决方案 包会自动将变量导出到在 foreach 循环中引用的工作人员。如果你不想这样做,你可以使用 foreach .noexport选项来防止它自动导出特定的变量。但是,如果我正确地理解了你的问题,那么你的问题就是R会复制其中的一些变量,因为它是在一台机器上的多个进程中发生的,所以这个问题比平常更为棘手。



没有办法声明一个变量,所以R永远不会重复它。你需要用像 bigmemory 这样的包中的对象替换问题变量,以便永远不会创建副本,或者可以尝试修改代码,以便不触发重复。您可以使用 tracemem 函数来帮助您,因为每当该对象被复制时它都会打印一条消息。

但是,您可以通过减少工作人员所需的数据来避免此问题。这减少了需要复制到每个工人的数据量,以及减少他们的内存占用。

这是给工人更多的经典例子数据比他们需要:

  x < -  matrix(1:100,10)
foreach(i = 1 :10,.combine ='c')%dopar%{
mean(x [,i])
}

由于在 foreach 循环中引用了矩阵 x 自动导出到每个工人,即使每个工人只需要列的一个子集。最简单的解决方案是迭代矩阵的实际列而不是列索引:

$ p $ foreach(xc = x,。 (b)(b)(b)(b)(b)(c) p>不仅是较少的数据传输给工作人员,而且每个工作人员实际上每次只需要在内存中有一列,这大大减少了大型矩阵的内存占用。 xc 向量可能仍然会被复制,但是它不会造成太多的伤害,因为它比 x doParallel 使用雪派生函数时才有帮助,比如 parLapply clusterApplyLB ,而不是在使用 mclapply 时。当使用 mclapply 时,使用这种技术可以使循环稍微慢一点,因为所有的工人都得到矩阵 x 那么为什么在工人们已经拥有整个矩阵的时候,为什么还要转移呢?但是,在Windows上, doParallel 不能使用 mclapply ,所以这个技巧非常重要。 b
$ b

重要的是要考虑工作人员真正需要哪些数据来执行工作,并尽可能减少工作量。有时你可以通过使用特殊的迭代器来完成这个工作,无论是从迭代器还是 itertools 包,但是你也可以通过改变你的算法做到这一点。


I am trying to run a foreach loop on a windows server with a 16 core CPU and 64 GB of RAM using RStudio. (using the doParallel package)

The "worker" processes copy over all the variables from outside the for loop (observed by watching the instantiation of these processes in windows task manager when the foreach loop is run), thus bloating up the memory used by each process. I tried to declare some of the especially large variables as global, while ensuring that these variables were also read from, and not written to, inside the foreach loop to avoid conflicts. However, the processes still quickly use up all available memory.

Is there a mechanism to ensure that the "worker" processes do not create copies of some of the "read-only" variables? Such as a specific way to declare such variables?

解决方案

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

Here is a classic example of giving the workers more data than they need:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.

这篇关于在R中使用foreach读取全局变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆