加载文件并行不工作foreach + data.table [英] loading files in parallel not working with foreach + data.table

查看:388
本文介绍了加载文件并行不工作foreach + data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 foreach 结合 data.table (v.1.8.7)加载文件并绑定它们。 foreach 不并行化,并返回警告...

  .table(matrix(rnorm(5e6),nrow = 5e5),myFile.csv,quote = F,sep =,,row.names = F,col.names = T)
library 。表);
#I为性能和可用性使用data.table 1.8.7(dev)fread
DT = fread(myFile.csv)

现在假设我有n个文件要加载和rowbind,我想parralellize它。
(我在Windows上,所以没有forking)

  allFiles = rep(myFile.csv #可以将3改为任何






/ p>

  f1<  -  function(allFiles){
DT< - lapply(allFiles,FUN = fread)#will按顺序加载myFile.csv 3次,带有fread
DT< - rbindlist(DT);
return(DT);
}





使用parallel(R的一部分为2.14.0)

  library(parallel)
f2< - function(allFiles){
mc< - detectCores #how多核?
cl< - makeCluster(mc); #build the cluster
DT< - parLapply(cl,allFiles,fun = fread); #call fread每个核心(well ...至少使用每个核心)
stopCluster(cl);
DT< - rbindlist(DT);
return(DT);
}





现在我想使用foreach

  library(foreach)
f3< - function(allFiles){
DT< - foreach(myFile = allFiles,.combine ='rbind',.inorder = FALSE)%dopar%fread(myFile)
return(DT);
}






这里有一些基准, foreach 工作

  system.time(DT< )); 
utilisateursystÞmeÚcoulÚ
34.61 0.14 34.84
system.time(DT < - f2(allFiles));
utilisateursystÞmeÚcoulÚ
1.03 0.40 24.30
system.time(DT < - f3(allFiles));
执行%dopar%顺序:没有并行后端注册
utilisateursystÞmeÚcoulÚ
35.05 0.22 35.38


解决方案

只是为了得到这个回答:



警告消息告诉你,没有并行后端注册 foreach 。请阅读本插页,了解如何操作。 / p>

来自小插曲的简单示例:

  $ b cl<  -  makeCluster(3)
registerDoParallel(cl)
foreach(i = 1:3)%dopar%sqrt(i)
pre>

I would like to use foreach in conjuction with data.table (v.1.8.7) to load files and bind them. foreach is not parallelizing, and returning a warning...

write.table(matrix(rnorm(5e6),nrow=5e5),"myFile.csv",quote=F,sep=",",row.names=F,col.names=T) 
library(data.table); 
#I use fread from data.table 1.8.7 (dev) for performance and useability
DT = fread("myFile.csv") 

Now suppose I have n of those files to load and rowbind, I would like to parralellize it. (I am on Windows, so no forking)

allFiles = rep("myFile.csv",4) # you can change 3 to whatever


using lapply

f1 <- function(allFiles){
    DT <- lapply(allFiles, FUN=fread) #will load sequentially myFile.csv 3 times with fread
    DT <- rbindlist(DT);
    return(DT);
}

using parallel (part of R as 2.14.0)

library(parallel)
f2 <- function(allFiles){
    mc <- detectCores(); #how many cores?
    cl <- makeCluster(mc); #build the cluster
    DT <- parLapply(cl,allFiles,fun=fread); #call fread on each core (well... using each core at least)
    stopCluster(cl);
    DT <- rbindlist(DT);
    return(DT);
}

now I want to use foreach

library(foreach)
f3 <- function(allFiles){
    DT <- foreach(myFile=allFiles, .combine='rbind', .inorder=FALSE) %dopar% fread(myFile)
    return(DT);
}


Here are some benchmarks confirming I can't kave foreach working

system.time(DT <- f1(allFiles));
utilisateur     systÞme      ÚcoulÚ
      34.61        0.14       34.84
system.time(DT <- f2(allFiles));
utilisateur     systÞme      ÚcoulÚ
       1.03        0.40       24.30    
system.time(DT <- f3(allFiles));
executing %dopar% sequentially: no parallel backend registered
utilisateur     systÞme      ÚcoulÚ
      35.05        0.22       35.38

解决方案

Just to get this answered:

As the warning message tells you, there is no parallel backend registered for foreach. Read this vignette to learn how to do that.

Simple example from the vignette:

library(doParallel) 
cl <- makeCluster(3) 
registerDoParallel(cl) 
foreach(i=1:3) %dopar% sqrt(i) 

这篇关于加载文件并行不工作foreach + data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆