如何替换程序包randomForest r中的引导程序步骤 [英] How do I replace the bootstrap step in the package randomForest r

查看:227
本文介绍了如何替换程序包randomForest r中的引导程序步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



在我的数据分析中,我尝试比较性能不同的机器学习方法对时间序列数据(回归,不分类)的影响。因此,举例来说,我已经训练了一个Boosting训练模型,并将其与随机森林训练模型(R package randomForest)进行比较。



我使用时间序列数据,其他数据和因变量的滞后值。

出于某种原因,随机森林严重表现不佳。我能想到的其中一个问题是随机森林为每棵树执行训练数据的采样步骤。如果它对时间序列数据这样做,系列的自回归性质将完全消除。为了测试这个想法,我想用(bootstrap)取样步骤在randomForest()函数中使用所谓的分块引导步骤。这基本上意味着我将训练集切割成k个部分,其中 k <,其中每个第k个部分都是原始顺序。如果我对这些k个部分进行采样,我仍然可以从随机森林中的随机性中获益,但时间序列性质基本保持完好。

Now我的问题是这样的:



为了实现这一点,我通常会复制现有的函数并编辑所需的步骤/行。

  randomForest2 < -  randomForest()



<但是randomForest()函数似乎是另一个用于更深层次底层函数的包装器的包装器。那么如何编辑randomForest()函数中的实际引导步骤,并仍然定期运行其余的函数?所以对我来说,解决方案不是编辑现有的randomForest函数。相反,我使用Soren H. Welling给出的 split2 函数自己编写了基于块的bootstrap来创建块。一旦我的数据按块进行引导,我就会寻找一个只执行一个回归树并且自己进行聚合的包( rpart )。 p>

我的实际数据结果与RMSPE方面的正常随机森林表现相比略有改善。



<对于下面的代码来说,性能似乎是一个硬币 - 折腾。



以Soren的代码为例,它看起来有点像这样:

  library(randomForest)
library(doParallel)#parallel包和mclapply更适合linux
library(rpart)

#parallel后端ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)

#simulated time series(y )随时间推移和滞后= 1
时间点= 1000; var = 6; noise.factor = .2

#past现在的方向
y = sin((1:timepoints )* pi / 30)* 1000 +
sin((1:时间点)* pi / 40)* 1000 + 1:时间点
y = y + rnorm(timepoints,sd = sd(y))* noise.factor
plot(y,type =l)

转换为绝对变化, (0,y [-1] -y [-length(y)])#c(0,t2-t1,t3-t2,...)

#compute lag
dy = dy + rnorm(timepoints)* sd(dy)* noise.factor #add noise
dy = c(0,y [-1] -y [ - (1:40,function(i){
getTheseLags =(1:timepoints) -
getTheseLags [getTheseLags< 1] = NA#开始之前删除时间点
dx.lag.i = dy [getTheseLags]
})
dX [is.na(dX)] = - 100# (数据框架(dy,dX [,1:5]),cex = .2)#data结构

#make train-并测试-set
train = 1:600
dy.train = dy [train]
dy.test = dy [-train]
dX.train = dX [train,]
dX.test = dX [-train,]

#classic rf
rf = randomForest(dX.train,dy.train,ntree = 500)
print( rf)

#不需要混合的矢量分割一个矢量
split2 = func (aVector,splits = 31){
lVector = length(aVector)
mod = lVector %%分割
lBlocks = rep(floor(lVector / split),split)
如果(mod!= 0)lBlocks [1:mod] = lBlocks [1:mod] + 1
lapply(1:splits,function(i){
Stop = sum(lBlocks [1:i ])
Start = Stop - lBlocks [i] + 1
aVector [开始:停止]
})
}


# (ttt in 1:numTrees)创建一个按块引导的样本列表
aBlock< - list()
numTrees< - 500
splits< - 40


aBlock [[ttt]]< - unlist(
sample(
split2(1:nrow(dX.train),splits = split),
分割,
替换= T


}

#将数据输入到数据帧,所以rpart理解它
df1 < - data.frame(dy.train,dX.train)
#执行块的回归树
rfBlocks = foreach(aBlock = aBlock,
.packages =(rpart))%dopar% {
dBlock = df1 [aBlock,]
rf = predict(rpart(dy.train〜。,data = dBlock,method =anova),newdata = data.frame(dX.test))
}

#predict测试,使结果表
#使用rowMeans来聚合分块预测
results = data.frame(predBlock = rowMeans(do.call(cbind.data.frame,rfBlocks)),
true = dy.test,
predBootstrap = predict(rf,newdata = dX.test)

plot(results [,1:2],xlab =OOB-CV预测变化 ,
ylab =trueChange,
main =black bootstrap and blue block train)
points(results [,3:2],xlab =OOB-CV预测变化,
ylab =trueChange,
col =blue)

#预测结果
print(cor(results)^ 2)


stopCluster(cl)#close cluster


First some background info, which is probably more interesting on stats.stackexchange:

In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).

I use time series data where the explanatory variables are lagged values of other data and the dependent variable.

For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.

To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.

Now my problem is this:

To achieve this I would normally copy the existing function and edit the desired step/lines.

randomForest2 <- randomForest()

But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?

解决方案

So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).

The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.

For the code below the performance seems to be a coin-toss.

Taking Soren's code as an example it looks a bit like this:

library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)

#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)

#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2

#past to present orientation    
y = sin((1:timepoints)*pi/30) * 1000 +
  sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")

#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)

#compute lag 
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1 
dX = sapply(1:40,function(i){
  getTheseLags = (1:timepoints) - i
  getTheseLags[getTheseLags<1] = NA #remove before start timePoints
  dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure

#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test  = dy[-train]
dX.train  = dX[ train,]
dX.test   = dX[-train,]

#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)

#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
  lVector = length(aVector)
  mod = lVector %% splits
  lBlocks = rep(floor(lVector/splits),splits)
  if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
  lapply(1:splits,function(i) {
    Stop  = sum(lBlocks[1:i])
    Start = Stop - lBlocks[i] + 1
    aVector[Start:Stop]
  })
}  


#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){

  aBlock[[ttt]] <- unlist(
    sample(
      split2(1:nrow(dX.train),splits=splits),
      splits,
      replace=T
    )
  )
}

#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
                   .packages=("rpart")) %dopar% {
                     dBlock = df1[aBlock,] 
                     rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) ) 
                   } 

#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock   = rowMeans(do.call(cbind.data.frame, rfBlocks)),
                     true=dy.test,
                     predBootstrap = predict(rf,newdata=dX.test)
                     )
plot(results[,1:2],xlab="OOB-CV predicted change",
     ylab="trueChange",
     main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
       ylab="trueChange",
       col="blue")

#prediction results
print(cor(results)^2)


stopCluster(cl)#close cluster

这篇关于如何替换程序包randomForest r中的引导程序步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆