如何在R中的`R6`类中使用`foreach`和`%dopar%`? [英] How to use `foreach` and `%dopar%` with an `R6` class in R?

查看:125
本文介绍了如何在R中的`R6`类中使用`foreach`和`%dopar%`?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试将%dopar%foreach()R6类一起使用时遇到问题.到处搜索,我只能找到两个与此相关的资源,一个未回答的 SO问题和一个打开的

I ran into an issue trying to use %dopar% and foreach() together with an R6 class. Searching around, I could only find two resources related to this, an unanswered SO question and an open GitHub issue on the R6 repository.

在一个评论(即GitHub问题)中,通过将类的parent_env重新分配为SomeClass$parent_env <- environment(),提出了一种解决方法.我想了解在foreach%dopar%中调用此表达式(即SomeClass$parent_env <- environment())时,environment()到底指的是什么?

In one comment (i.e., GitHub issue) an workaround is suggested by reassigning the parent_env of the class as SomeClass$parent_env <- environment(). I would like to understand what exactly does environment() refer to when this expression (i.e., SomeClass$parent_env <- environment()) is called within the %dopar% of foreach?

这是一个最小的可重现示例:

Here is a minimal reproducible example:

Work <- R6::R6Class("Work",

    public = list(
        values = NULL,


        initialize = function() {
            self$values <- "some values"
        }
    )
)

现在,下面的Task类在构造函数中使用Work类.

Now, the following Task class uses the Work class in the constructor.

Task <- R6::R6Class("Task",
    private = list(
        ..work = NULL
    ),


    public = list(
        initialize = function(time) {
            private$..work <- Work$new()
            Sys.sleep(time)
        }
    ),


    active = list(
        work = function() {
            return(private$..work)
        }
    )
)

Factory类中,创建了Task类,并在..m.thread()中实现了foreach.

In the Factory class, the Task class is created and the foreach is implemented in ..m.thread().

Factory<- R6::R6Class("Factory",

    private = list(
        ..warehouse = list(),
        ..amount = NULL,
        ..parallel = NULL,


        ..m.thread = function(object, ...) {
            cluster <- parallel::makeCluster(parallel::detectCores() -  1)
            doParallel::registerDoParallel(cluster)

            private$..warehouse <- foreach::foreach(1:private$..amount, .export = c("Work")) %dopar% {
                # What exactly does `environment()` encapsulate in this context?
                object$parent_env <- environment()
                object$new(...) 
            }

            parallel::stopCluster(cluster)
        },


        ..s.thread = function(object, ...) {
            for (i in 1:private$..amount) {
               private$..warehouse[[i]] <- object$new(...)
            }
        },


        ..run = function(object, ...) {
            if(private$..parallel) {
                private$..m.thread(object, ...)
            } else {
                private$..s.thread(object, ...)
            }
        }
    ),


    public = list(
        initialize = function(object, ..., amount = 10, parallel = FALSE) {
            private$..amount = amount
            private$..parallel = parallel

            private$..run(object, ...)
        }
    ),


    active = list(
        warehouse = function() {
            return(private$..warehouse)
        }
    )
)

然后,它被称为:

library(foreach)

x = Factory$new(Task, time = 2, amount = 10, parallel = TRUE)

如果没有以下行object$parent_env <- environment(),则会引发错误(即,如其他两个链接中所述):Error in { : task 1 failed - "object 'Work' not found".

Without the following line object$parent_env <- environment(), it throws an error (i.e., as mentioned in the other two links): Error in { : task 1 failed - "object 'Work' not found".

我想知道,(1)在foreach中分配parent_env时有哪些潜在的陷阱?(2)为什么首先起作用?

I would like to know, (1) what are some potential pitfalls when assigning the parent_env inside foreach and (2) why does it work in the first place?

更新1:

  • 我从foreach()内部返回了environment(),以便private$..warehouse捕获那些环境
  • 在调试会话中使用rlang::env_print()(即browser()语句在foreach结束执行之后立即放置),它们是由以下内容组成的:
  • I returned environment() from within foreach(), such that private$..warehouse captures those environments
  • using rlang::env_print() in a debug session (i.e., the browser() statement was placed right after foreach has ended execution) here is what they consist of:
Browse[1]> env_print(private$..warehouse[[1]])

# <environment: 000000001A8332F0>
# parent: <environment: global>
# bindings:
#  * Work: <S3: R6ClassGenerator>
#  * ...: <...>

Browse[1]> env_print(environment())

# <environment: 000000001AC0F890>
# parent: <environment: 000000001AC20AF0>
# bindings:
#  * private: <env>
#  * cluster: <S3: SOCKcluster>
#  * ...: <...>

Browse[1]> env_print(parent.env(environment()))

# <environment: 000000001AC20AF0>
# parent: <environment: global>
# bindings:
#  * private: <env>
#  * self: <S3: Factory>

Browse[1]> env_print(parent.env(parent.env(environment())))

# <environment: global>
# parent: <environment: package:rlang>
# bindings:
#  * Work: <S3: R6ClassGenerator>
#  * .Random.seed: <int>
#  * Factory: <S3: R6ClassGenerator>
#  * Task: <S3: R6ClassGenerator>

推荐答案

免责声明:我在此所说的很多内容都是根据我所知道的知识进行的有根据的猜测和推论, 我不能保证所有内容都是100%正确.

Disclaimer: a lot of what I say here are educated guesses and inferences based on what I know, I can't guarantee everything is 100% correct.

我认为可能会有很多陷阱, 哪种方法真正取决于您的工作. 我认为您的第二个问题更重要, 因为如果您了解这一点, 您将可以自己评估一些陷阱.

I think there can be many pitfalls, and which one applies really depends on what you do. I think your second question is more important, because if you understand that, you'll be able to evaluate some of the pitfalls by yourself.

这个话题很复杂, 但是您可能可以从阅读 R的词汇范围开始. 本质上,R具有某种环境层次结构, 当执行R代码时, 在当前环境中找不到其值的变量 (这是environment()返回的内容) 在 parent 环境中寻求 (不要与调用者环境混淆).

The topic is rather complex, but you can probably start by reading about R's lexical scoping. In essence, R has a sort of hierarchy of environments, and when R code is executed, variables whose values are not found in the current environment (which is what environment() returns) are sought in the parent environments (not to be confused with the caller environments).

根据您链接的GitHub问题, R6生成器保存对其父环境的引用", 他们希望可以在上述父级或环境层次结构中的某个位置找到其类可能需要的所有内容, 从那个父母那里开始,然后向上".

Based on the GitHub issue you linked, R6 generators save a "reference" to their parent environments, and they expect that everything their classes may need can be found in said parent or somewhere along the environment hierarchy, starting at that parent and going "up".

您使用替代方法的原因是,您要用并行工作程序中当前foreach调用中的生成器替换生成器的父环境. (可能是不同的R进程,不一定是不同的线程), 并且,鉴于您的.export规范可能会导出必要的值, 然后,R的词法作用域可以从单独的线程/进程中的foreach调用开始搜索缺失值.

The reason the workaround you're using works is because you're replacing the generator's parent environment with the one in the current foreach call inside the parallel worker (which may be a different R process, not necessarily a different thread), and, given your .export specification probably exports necessary values, R's lexical scoping can then search for missing values starting from the foreach call in the separate thread/process.

对于您链接的特定示例, 我发现了一种更简单的方法来使其工作 (至少在我的Linux机器上) 是要执行以下操作:

For the specific example you linked, I found that a simpler way to make it work (at least on my Linux machine) is to do the following:

library(doParallel)

cluster <- parallel::makeCluster(parallel::detectCores() -  1)
doParallel::registerDoParallel(cluster)
parallel::clusterExport(cluster, setdiff(ls(), "cluster"))

x = Factory$new(Task, time = 1, amount = 3)

但将..m.thread函数保留为:

..m.thread = function(object, amount, ...) {
    private$..warehouse <- foreach::foreach(1:amount) %dopar% {
        object$new(...) 
    }
}

(并在完成后手动调用stopCluster).

(and manually call stopCluster when done).

clusterExport调用应具有类似于*的语义: 从主要R流程的全局环境中获取所有内容,但cluster除外, 并使其可在每个并行工作人员的全球环境中使用. 这样,当词法作用域到达其各自的全局环境时,foreach调用内的任何代码都可以使用生成器. foreach可能很聪明,可以自动导出一些变量 (如GitHub问题所示), 但它有局限性 并且词汇范围界定中使用的层次结构可能会变得非常混乱.

The clusterExport call should have semantics similar to*: take everything from the main R process' global environment except cluster, and make it available in each parallel worker's global environment. That way, any code inside the foreach call can use the generators when lexical scoping reaches their respective global environments. foreach can be clever and exports some variables automatically (as shown in the GitHub issue), but it has limitations, and the hierarchy used during lexical scoping can get very messy.

*我之所以说类似于",是因为我不知道R在分叉的情况下究竟能做什么来区分(全局)环境, 但是由于需要导出, 我认为它们确实彼此独立.

*I say "similar to" because I don't know what exactly R does to distinguish (global) environments if forks are used, but since that export is needed, I assume they are indeed independent of each other.

PS:如果您在函数调用中创建工作程序,则可以使用on.exit(parallel::stopCluster(cluster))的调用, 这样一来,您就可以避免遗留进程,直到出现错误将它们以某种方式停止为止.

PS: I'd use a call to on.exit(parallel::stopCluster(cluster)) if you create workers inside a function call, that way you avoid leaving processes around until they are somehow stopped if an error occurs.

这篇关于如何在R中的`R6`类中使用`foreach`和`%dopar%`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆