multidplyr:试用自定义功能 [英] multidplyr: trial custom function

查看:64
本文介绍了multidplyr:试用自定义功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 multidplyr :: do()运行自定义函数代码> 在群集上.考虑这个简单的自包含示例.例如,我正在尝试将自定义函数 myWxTest 应用于 flight 数据集中的每个 common_dest (飞行次数超过50个的目的地):

I'm trying to learn to run a custom function through multidplyr::do() on a cluster. Consider this simple self contained example. For example's sake, I'm trying to apply my custom function myWxTest to each common_dest (destinations with more than 50 flights) in the flight dataset:

library(dplyr)
library(multidplyr)
library(nycflights13)
library(quantreg)

myWxTest <- function(x){
    stopifnot(!is.null(x$dep_time))
    stopifnot(!is.null(x$dep_delay))
    stopifnot(!is.null(x$sched_dep_time))
    stopifnot(!is.null(x$sched_arr_time))
    stopifnot(!is.null(x$arr_time))

    out_mat <- c('(Intercept)' = NA, dep_time = NA, dep_delay = NA, sched_dep_time = NA, sched_arr_time = NA)
    if(length(x$arr_time)>5){
        model_1 <- quantreg::rq(arr_time ~ dep_time + dep_delay + sched_dep_time + sched_arr_time, data = x, tau = .5)
        out_mat[names(coef(model_1))] <- coef(model_1)
    }
    return(out_mat)
}

common_dest <- flights %>%
  count(dest) %>%
  filter(n >= 365) %>%
  semi_join(flights, .) %>% 
  mutate(yday = lubridate::yday(ISOdate(year, month, day)))


cluster <- create_cluster(2)
set_default_cluster(cluster)
by_dest <- common_dest %>% 
           partition(dest, cluster = cluster)
cluster_library(by_dest, "quantreg")

到目前为止,一切都很好(但我只是复制了插图中的示例).现在,我必须将自定义函数发送到每个节点:

So far so good (but I'm just reproducing the examples from the vignette). Now, I have to send my custom function to each node:

cluster %>% cluster_call(myWxTest)

但是我得到:

Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  2 nodes produced errors; first error: argument "x" is missing, with no default

最终,我想将 myWxTest 应用于每个子组:

eventually, I want to apply myWxTest to each subgroup:

models <- by_dest %>% 
          do(myWxTest(.))

推荐答案

我通过一些调整使它运行:

I got it running with a couple tweaks:

library(dplyr)
library(multidplyr)
library(nycflights13)
library(quantreg)

myWxTest <- function(x){
    stopifnot(!is.null(x$dep_time))
    stopifnot(!is.null(x$dep_delay))
    stopifnot(!is.null(x$sched_dep_time))
    stopifnot(!is.null(x$sched_arr_time))
    stopifnot(!is.null(x$arr_time))

    out_mat <- c('(Intercept)' = NA, dep_time = NA, dep_delay = NA, sched_dep_time = NA, sched_arr_time = NA)
    if(length(x$arr_time)>5){
        model_1 <- quantreg::rq(arr_time ~ dep_time + dep_delay + sched_dep_time + sched_arr_time, data = x, tau = .5)
        out_mat[names(coef(model_1))] <- coef(model_1)
    }
    return(as.data.frame(out_mat, stringsAsFactors = FALSE))    # change result to data.frame, not matrix
}

common_dest <- flights %>%
    count(dest) %>%
    filter(n >= 365) %>%
    semi_join(flights, .) %>% 
    mutate(yday = lubridate::yday(ISOdate(year, month, day)))

by_dest <- common_dest %>% partition(dest)

cluster_library(by_dest, "quantreg")
cluster_copy(by_dest, myWxTest)    # copy function to each node

models <- by_dest %>% do(myWxTest(.)) %>% collect()    # collect data from clusters

...返回本地data.frame:

...which returns a local data.frame:

models
#> Source: local data frame [390 x 2]
#> Groups: dest [78]
#> 
#>     dest     out_mat
#>    <chr>       <dbl>
#> 1    CAK 156.5248953
#> 2    CAK   0.9904261
#> 3    CAK  -0.0767928
#> 4    CAK  -0.3523211
#> 5    CAK   0.3220386
#> 6    DCA  74.5959035
#> 7    DCA   0.2751917
#> 8    DCA   1.0712483
#> 9    DCA   0.2874165
#> 10   DCA   0.4344960
#> # ... with 380 more rows

这篇关于multidplyr:试用自定义功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆