内部函数中的parLapply意外将数据复制到节点 [英] parLapply from inside function copies data to nodes unexpectedly

查看:88
本文介绍了内部函数中的parLapply意外将数据复制到节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的列表(约30GB),其功能如下:

I have a large list (~30GB) and functions as follows:

cl <- makeCluster(24, outfile = "")

Foo1 <- function(cl, largeList) {
  return(parLapply(cl, largeList, Bar))
}

Bar1 <- function(listElement) {
  return(nrow(listElement))
}

Foo2 <- function(cl, largeList, arg) {
  clusterExport(cl, list("arg"), envir = environment())
  return(parLapply(cl, largeList, function(x) Bar(x, arg)))
}

Bar2 <- function(listElement, arg) {
  return(nrow(listElement))
}

没有问题:

Foo1(cl, largeList)

观察每个进程的内存使用情况,我可以看到只有一个列表元素被复制到每个节点.

Watching the memory usage for each process I can see that only one list element is being copied to each node.

但是,在致电时:

Foo2(cl, largeList, 0)

largeList的副本正在复制到每个节点.逐步执行Foo2,不会在clusterExport上进行largeList复制,而是在parLapply上进行.另外,当我从全局环境(不在函数内)执行Foo2的主体时,也没有问题.是什么原因造成的?

a copy of largeList is being copied to each node. Stepping through Foo2, the largeList copying is not happening at clusterExport, but rather on parLapply. Also, when I execute the body of Foo2 from the global environment (not within a function), there are no issues. What is causing this?

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 21 (Twenty One)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] xts_0.9-7           zoo_1.7-12          snow_0.3-13        
[4] Rcpp_0.12.2         randomForest_4.6-12 gbm_2.1.1          
[7] lattice_0.20-33     survival_2.38-3     e1071_1.6-7        

loaded via a namespace (and not attached):
[1] class_7.3-13 tools_3.2.2  grid_3.2.2 

推荐答案

问题是,作为parLapply的第三个参数的worker函数将被序列化,并与输入数据一起发送给每个worker.如果worker函数是在诸如Foo2之类的函数中定义的,则本地环境将与其一起序列化.由于largeListFoo2的参数,因此它在本地环境中,因此与worker函数一起进行了序列化.

The problem is that the worker function, which is the third argument to parLapply, is serialized and sent to each of the workers along with the input data. If the worker function is defined inside a function, such as Foo2, then the local environment is serialized along with it. Since largeList is an argument to Foo2, it is in the local environment, and therefore serialized along with the worker function.

您对Foo1没什么问题,因为Bar大概是在全局环境中创建的,并且全局环境永远不会与函数一起序列化.

You didn't have a problem with Foo1 because Bar was presumably created in the global environment, and the global environment is never serialized along with functions.

换句话说,最好在使用parLapplyclusterApplyclusterApplyLB等时始终在全局环境或程序包中定义辅助函数.当然,如果要调用parLapply从全局环境中,匿名函数是在全局环境中定义的.

In other words, it's a good idea to always define the worker function in the global environment or in a package when using parLapply, clusterApply, clusterApplyLB, etc. Of course, if you're calling parLapply from the global environment, the anonymous function is defined in the global environment.

这篇关于内部函数中的parLapply意外将数据复制到节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆