随机森林引导训练和森林生成 [英] Random forest bootstrap training and forest generation
问题描述
我有大量关于随机森林的训练数据(昏暗:47600811 * 9).我想获取尺寸为10000 * 9的多个(假设有1000个)自举示例(每次运行均获取9000个负类和1000个正类数据点),并为所有这些迭代生成树,然后将所有这些树合并为1个森林. 下面给出了所需代码的粗略概念.有人可以指导我如何从实际的trainData中生成替换后的随机样本,并为它们迭代地优化生成树吗?这将是极大的帮助.谢谢
I have a huge training data for random forest (dim: 47600811*9). I want to take multiple (let's say 1000) bootstrapped sample of dimension 10000*9 (taking 9000 negative class and 1000 positive class datapoints in each run) and iteratively generate trees for all of them and then combine all those trees into 1 forest. A rough idea of required code is given below. Can anbody guide me how can I generate random sample with replacement from my actual trainData and optimally generate trees for them iteratively? It will be great help. Thanks
library(doSNOW)
library(randomForest)
cl <- makeCluster(8)
registerDoSNOW(cl)
for (i=1:1000){
B <- 1000
U <- 9000
dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]
dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]
subset <- rbind(dataB, dataU)
我不确定这是否是从实际trainData一次又一次(1000次)生成子集的最佳方式.
I am not sure if it is the optimal way of producing a subset again and again (1000 times) from actual trainData.
rf <- foreach(ntree=rep(125, 8), .packages='randomForest') %dopar% {
randomForest(subset[,-1], subset$class, ntree=ntree)
}
}
crf <- do.call('combine', rf)
print(crf)
stopCluster(cl)
推荐答案
尽管您的示例并行化了内部循环而不是外部循环,但只要内部foreach循环执行花费多于几秒钟的时间,它可能就可以正常工作了,几乎可以肯定.但是,您的程序确实存在一个错误:它会丢弃前999个foreach结果,而仅处理最后一个结果.要解决此问题,您可以预分配一个长度为1000 * 8的列表,并在外部for循环的每次迭代中将来自foreach的结果分配给它.例如:
Although your example parallelizes the inner rather than the outer loop, it may work reasonably well as long as the inner foreach loop takes more than a few seconds to execute, which it almost certainly does. However, your program does have a bug: it is throwing away the first 999 foreach results and only processing the last result. To fix this, you could preallocate a list of length 1000*8 and assign the results from foreach into it on each iteration of the outer for loop. For example:
library(doSNOW)
library(randomForest)
trainData <- data.frame(a=rnorm(20), b=rnorm(20),
class=c(rep("U", 10), rep("B", 10)))
n <- 1000 # outer loop count
chunksize <- 125 # value of ntree used in inner loop
nw <- 8 # number of cluster workers
cl <- makeCluster(nw)
registerDoSNOW(cl)
rf <- vector('list', n * nw)
for (i in 1:n) {
B <- 1000
U <- 9000
dataB <- trainData[sample(which(trainData$class == "B"), B,replace=TRUE),]
dataU <- trainData[sample(which(trainData$class == "U"), U,replace=TRUE),]
subset <- rbind(dataB, dataU)
ix <- seq((i-1) * nw + 1, i * nw)
rf[ix] <- foreach(ntree=rep(chunksize, nw),
.packages='randomForest') %dopar% {
randomForest(subset[,-1], subset$class, ntree=ntree)
}
}
cat(sprintf("# models: %d; expected # models: %d\n", length(rf), n * nw))
cat(sprintf("expected total # trees: %d\n", n * nw * chunksize))
crf <- do.call('combine', rf)
print(crf)
这应该可以解决您在发给我的评论中提到的问题.
This should fix the problem that you mention in the comment that you directed to me.
这篇关于随机森林引导训练和森林生成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!