如何在R并行计算中使用Reduce()函数? [英] How to use Reduce() function in R parallel computing?

查看:722
本文介绍了如何在R并行计算中使用Reduce()函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想运行Reduce代码来out1 66000个列表元素的列表:

I want to run a Reduce code to out1 a list of 66000 list elements:

trialStep1_done <- Reduce(rbind, out1)

但是,运行时间太长.我想知道是否可以借助并行计算程序包来运行此代码.

However, it takes too long to run. I wonder whether I can run this code with help of a parallel computing package.

我知道有mclapplymcMap,但是在并行计算程序包中看不到任何像mcReduce的函数.

I know there is mclapply, mcMap, but I don't see any function like mcReduce in parallel computing package.

是否有像mcReduce这样的函数可用于在R中并行执行Reduce来完成我想做的任务?

Is there a function like mcReduce available for doing Reduce with parallel in R to complete the task I wanted to do?

非常感谢@BrodieG和@zheYuan Li,您的回答非常有帮助.我认为以下代码示例可以更精确地表示我的问题:

Thanks a lot @BrodieG and @zheYuan Li, your answers are very helpful. I think the following code example can represent my question with more precision:

df1 <- data.frame(a=letters, b=LETTERS, c=1:26 %>% as.character())
set.seed(123)
df2 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
set.seed(1234)
df3 <- data.frame(a=letters %>% sample(), b=LETTERS %>% sample(), c=1:26 %>% sample() %>% as.character())
out1 <- list(df1, df2, df3)

# I don't know how to rbind() the list elements only using matrix()
# I have to use lapply() and Reduce() or do.call()
out2 <- lapply(out1, function(x) matrix(unlist(x), ncol = length(x), byrow = F))

Reduce(rbind, out2)
do.call(rbind, out2)
# One thing is sure is that `do.call()` is super faster than `Reduce()`, @BordieG's answer helps me understood why. 

因此,在这一点上,对于我的200000行数据集,do.call()很好地解决了这个问题.

So, at this point, to my 200000 rows dataset, do.call() solves the problem very well.

最后,我想知道这是否是更快的方法?还是可以在这里用matrix()演示@ZheYuanLi的方式?

Finally, I wonder whether this is an even faster way? or the way @ZheYuanLi demostrated with just matrix() could be possible here?

推荐答案

问题不是rbind,问题是Reduce.不幸的是,R中的函数调用非常昂贵,尤其是当您继续创建新对象时.在这种情况下,您调用rbind 65999次,每次创建一个新的R对象并添加一行.相反,您只能使用66000个参数调用一次rbind,这将更快,因为内部rbind将在C中进行绑定,而不必调用R函数66000次并仅分配一次内存.在这里,我们将您的Reduce使用与Zheyuan的矩阵/未列表进行比较,最后将rbind与使用do.call调用一次的rbind(do.call允许您将所有参数指定为列表的函数)进行比较:

The problem is not rbind, the problem is Reduce. Unfortunately, function calls in R are expensive, and particularly so when you keep creating new objects. In this case, you call rbind 65999 times, and each time you do you create a new R object with one row added. Instead, you can just call rbind once with 66000 arguments, which will be much faster since internally rbind will do the binding in C without having to call R functions 66000 times and allocating the memory just once. Here we compare your Reduce use with Zheyuan's matrix/unlist and finally with rbind called once with do.call (do.call allows you to call a function with all arguments specified as a list):

out1 <- replicate(1000, 1:20, simplify=FALSE)  # use 1000 elements for illustrative purposes

library(microbenchmark)    
microbenchmark(times=10,
  a <- do.call(rbind, out1),
  b <- matrix(unlist(out1), ncol=20, byrow=TRUE),
  c <- Reduce(rbind, out1)
)
# Unit: microseconds
#                                                expr        min         lq
#                           a <- do.call(rbind, out1)    469.873    479.815
#  b <- matrix(unlist(out1), ncol = 20, byrow = TRUE)    257.263    260.479
#                            c <- Reduce(rbind, out1) 110764.898 113976.376
all.equal(a, b, check.attributes=FALSE)
# [1] TRUE
all.equal(b, c, check.attributes=FALSE)
# [1] TRUE

浙源是最快的,但是无论从什么目的和目的来看,do.call(rbind())方法都非常相似.

Zheyuan is the fastest, but for all intents and purposes the do.call(rbind()) method is pretty similar.

这篇关于如何在R并行计算中使用Reduce()函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆