R中的并行计算:如何使用内核 [英] Parallel Computing in R : how to use the cores

查看:122
本文介绍了R中的并行计算:如何使用内核的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试在R中进行并行计算. 我正在尝试训练后勤岭模型,并且我的计算机上目前有4个核.我想将我的数据集平均分为4个部分,并使用每个核来训练模型(基于训练数据),并将每个核的结果保存到单个向量中.问题是我不知道如何执行此操作,现在我尝试与foreach包并行运行,但是问题是每个内核都看到相同的训练数据.这是带有foreach包的代码(不会拆分数据):

I am currently trying parallel computing in R. I am trying to train a logistic ridge model , and I currently have 4 Cores on my computer. I would like to split my data set equally into 4 pieces, and use each core to train model (on the training data) and save the result of each core into a single vector . the problem is that i have no clue how to do it, right now I tried to parallel with the foreach package, but the problem is the each core sees the same training data. here is the code with the foreach package (which doesn't split the data) :

library(ridge)
library(parallel)
library(foreach)

num_of_cores <- detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)
result <- data.frame()

r <- foreach(icount(4), .combine = cbind) %dopar% {
      result <- logisticRidge(admit~ gre + gpa + rank,data = mydata)
      coefficients(result)
}

有什么主意如何将数据同时分成x个块并并行训练模型?

any idea how to simultaneously split the data into x chunks and train the models in parallel ?

推荐答案

这样的事情怎么样?它使用snowfall而不是foreach -library,但应给出相同的结果.

How about something like this? It uses snowfall instead of the foreach-library, but should give the same results.

library(snowfall)
library(ridge)

# for reproducability
set.seed(123)
num_of_cores <- parallel::detectCores()
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
data_per_core <- floor(nrow(mydata)/num_of_cores)

# we take random rows to each cluster, by sampleid
mydata$sampleid <- sample(1:num_of_cores, nrow(mydata), replace = T)

# create a small function that calculates the coefficients
regfun <- function(dat) {
  library(ridge) # this has to be in the function, otherwise snowfall doesnt know the logistic ridge function
  result <- logisticRidge(admit~ gre + gpa + rank, data = dat)
  coefs <- as.numeric(coefficients(result))
  return(coefs)
}

# prepare the data
datlist <- lapply(1:num_of_cores, function(i){
  dat <- mydata[mydata$sampleid == i, ]
})

# initiate the clusters
sfInit(parallel = T, cpus = num_of_cores)

# export the function and the data to the cluster
sfExport("regfun")

# calculate, (sfClusterApply is very similar to sapply)
res <- sfClusterApply(datlist, function(datlist.element) {
  regfun(dat = datlist.element)
})

#stop the cluster
sfStop()

# convert the list to a data.frame. data.table::rbindlist(list(res)) does the same job
res <- data.frame(t(matrix(unlist(res), ncol = num_of_cores)))
names(res) <- c("intercept", "gre", "gpa", "rank")
res
# res
# intercept          gre
# 1 -3.002592 1.558363e-03
# 2 -4.142939 1.060692e-03
# 3 -2.967130 2.315487e-03
# 4 -1.176943 4.786894e-05
# gpa         rank
# 1  0.7048146997 -0.382462408
# 2  0.9978841880 -0.314589628
# 3  0.6797382218 -0.464219036
# 4 -0.0004576679 -0.007618317

这篇关于R中的并行计算:如何使用内核的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆