在函数中并行* ply [英] Parallel *ply within functions

查看:95
本文介绍了在函数中并行* ply的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在函数中使用plyr包的并行功能.

I want to use the parallel functionality of the plyr package within functions.

我会认为,导出在函数体内创建的对象的正确方法如下(在本示例中,对象为df_2)

I would have thought that the proper way to export objects that have been created within the body of the function (in this example, the object is df_2) is as follows

# rm(list=ls())
library(plyr)
library(doParallel)

workers=makeCluster(2)
registerDoParallel(workers,core=2)

plyr_test=function() {
  df_1=data.frame(type=c("a","b"),x=1:2)
  df_2=data.frame(type=c("a","b"),x=3:4)

  #export df_2 via .paropts  
  ddply(df_1,"type",.parallel=TRUE,.paropts=list(.export="df_2"),.fun=function(y) {
    merge(y,df_2,all=FALSE,by="type")
  })
}
plyr_test()
stopCluster(workers)

但是,这会引发错误

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) : 
  unable to find variable "df_2"

因此,我进行了一些研究,发现如果我手动导出df_2,它会起作用

So I did some research and found out that it works if I export df_2 manually

workers=makeCluster(2)
registerDoParallel(workers,core=2)

plyr_test_2=function() {
  df_1=data.frame(type=c("a","b"),x=1:2)
  df_2=data.frame(type=c("a","b"),x=3:4)

  #manually export df_2
  clusterExport(cl=workers,varlist=list("df_2"),envir=environment())

  ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
    merge(y,df_2,all=FALSE,by="type")
  })
}
plyr_test_2()
stopCluster(workers)

它给出正确的结果

  type x.x x.y
1    a   1   3
2    b   2   4

但是我也发现以下代码有效

But I have also found out that the following code works

workers=makeCluster(2)
registerDoParallel(workers,core=2)

plyr_test_3=function() {
  df_1=data.frame(type=c("a","b"),x=1:2)
  df_2=data.frame(type=c("a","b"),x=3:4)

  #no export at all!
  ddply(df_1,"type",.parallel=TRUE,.fun=function(y) {
    merge(y,df_2,all=FALSE,by="type")
  })
}
plyr_test_3()
stopCluster(workers)

plyr_test_3()也会给出正确的结果,我不明白为什么.我本以为我必须导出df_2 ...

plyr_test_3() also gives the correct result and I don't understand why. I would have thought that I have to export df_2...

我的问题是:在函数中处理并行*ply的正确方法是什么?显然,plyr_test()是不正确的.我以某种方式感到plyr_test_2()中的手动导出是无用的.但是我也认为plyr_test_3()是一种不好的编码风格.有人可以详细说明吗?谢谢大家!

My question is: What is the right way to deal with parallel *ply within functions? Obviously, plyr_test() is incorrect. I somehow have the feeling that the manual export in plyr_test_2() is useless. But I also think that plyr_test_3() is kind of bad coding style. Could someone please elaborate on that? Thanks guys!

推荐答案

plyr_test的问题是df_2是在plyr_test中定义的,无法从doParallel程序包访问它,因此失败尝试导出df_2时.因此,这是一个范围界定问题. plyr_test2避免了此问题,因为它不会尝试使用.export选项,但是正如您所猜测的,不需要调用clusterExport.

The problem with plyr_test is that df_2 is defined in plyr_test which isn't accessible from the doParallel package, and therefore it fails when it tries to export df_2. So that is a scoping issue. plyr_test2 avoids this problem because is doesn't try to use the .export option, but as you guessed, the call to clusterExport is not needed.

plyr_test2plyr_test3均成功的原因是df_2与匿名函数一起被序列化,该匿名函数通过.fun参数传递给ddply函数.实际上,df_1df_2都与匿名函数一起被序列化,因为该函数是在plyr_test2plyr_test3内部定义的.在这种情况下包含df_2会很有帮助,但是不必包含df_1,这可能会损害您的性能.

The reason that both plyr_test2 and plyr_test3 succeed is that df_2 is serialized along with the anonymous function that is passed to the ddply function via the .fun argument. In fact, both df_1 and df_2 are serialized along with the anonymous function because that function is defined inside plyr_test2 and plyr_test3. It's helpful that df_2 is included in this case, but the inclusion of df_1 is unnecessary and may hurt your performance.

只要在匿名函数的环境中捕获了df_2,无论您导出什么内容,都不会使用df_2的其他值.除非您可以阻止捕获它,否则用.exportclusterExport导出它是没有意义的,因为将使用捕获的值.通过尝试将其导出到工作人员,您只会遇到麻烦(就像您执行.export一样).

As long as df_2 is captured in the environment of the anonymous function, no other value of df_2 will ever be used, regardless of what you export. Unless you can prevent it from being captured, it is pointless to export it either with .export or clusterExport because the captured value will be used. You can only get yourself into trouble (as you did the .export) by trying to export it to the workers.

请注意,在这种情况下,foreach不会自动导出df_2,因为它无法分析匿名函数的主体以查看所引用的符号.如果您直接调用foreach而不使用匿名函数,则它将看到该引用并自动将其导出,从而无需使用.export显式导出它.

Note that in this case, foreach does not auto-export df_2 because it isn't able to analyze the body of the anonymous function to see what symbols are referenced. If you call foreach directly without using an anonymous function, then it will see the reference and auto-export it, making it unnecessary to explicitly export it using .export.

您可以通过在将plyr_test的环境传递给ddply之前对其环境进行修改来防止其与匿名函数一起被序列化:

You could prevent the environment of plyr_test from being serialized along with the anonymous function by modifying it's environment before passing it to ddply:

plyr_test=function() {
  df_1=data.frame(type=c("a","b"),x=1:2)
  df_2=data.frame(type=c("a","b"),x=3:4)
  clusterExport(cl=workers,varlist=list("df_2"),envir=environment())
  fun=function(y) merge(y, df_2, all=FALSE, by="type")
  environment(fun)=globalenv()
  ddply(df_1,"type",.parallel=TRUE,.fun=fun)
}

foreach软件包的优点之一是,它不鼓励您在另一个函数内部创建一个可能会意外捕获大量变量的函数.

One of the advantages of the foreach package is that it doesn't encourage you to create a function inside of another function that might be capturing a bunch of variables accidentally.

此问题向我建议foreach应包含一个名为.exportenv的选项,该选项类似于clusterExport envir选项.这对于plyr非常有用,因为它将允许使用.export正确导出df_2.但是,除非从.fun函数中删除了包含df_2的环境,否则仍不会使用该导出的值.

This issue suggests to me that foreach should include an option called .exportenv that is similar to the clusterExport envir option. That would be very helpful for plyr, since it would allow df_2 to be correctly exported using .export. However, that exported value still wouldn't be used unless the environment containing df_2 was removed from the .fun function.

这篇关于在函数中并行* ply的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆