分层抽样似乎不会改变 randomForest 结果 [英] Stratified sampling doesn't seem to change randomForest results

查看:68
本文介绍了分层抽样似乎不会改变 randomForest 结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中使用 randomForest 包来构建多个物种分布模型.我的响应变量是二元的(0 - 缺席或 1 存在),并且非常不平衡 - 对于某些物种,缺席:存在的比率是 37:1.这种不平衡(或零通胀)导致了有问题的袋外误差估计——缺席与存在的比率越大,我的袋外 (OOB) 误差估计越低.

I am using the randomForest package in R to build several species distribution models. My response variable is binary (0 - absence or 1-presence), and pretty unbalanced - for some species the ratio of absences:presences is 37:1. This imbalance (or zero-inflation) leads to questionable out-of-bag error estimates - the larger the ratio of absences to presence, the lower my out-of-bag (OOB) error estimate.

为了弥补这种不平衡,我想实施分层抽样,以便随机森林中的每棵树都包含相等(或至少不平衡)数量的来自存在和不存在类别的结果.我很惊讶分层和未分层模型的 OOB 误差估计似乎没有任何差异.请参阅下面的代码:

To compensate for this imbalance, I wanted to implement stratified sampling such that each tree in the random forest included an equal (or at least less imbalanced) number of results from both the presence and absences category. I was surprised that there doesn't seem to be any difference in the stratified and unstratified model OOB error estimates. See my code below:

> set.seed(25)
> HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
> HHrf
Call:
  randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 19.1%
Confusion matrix:
    0  1 class.error
0 422 18  0.04090909
1  84 10  0.89361702

有分层

> HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, strata = bll_HH$HH_Pres, sampsize = ceiling(.632*nrow(bll_HH)))
> HHrf

Call:
 randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 19.1%
Confusion matrix:
    0  1 class.error
0 422 18  0.04090909
1  84 10  0.89361702

在两种情况下我得到相同的结果是否有原因?对于strata 参数,我指定了我的响应变量HH_Pres.对于 sampsize 参数,我指定它应该只是整个数据集的 63.2%.

Is there a reason that I am getting the same results in both cases? For the strata argument, I specify my response variable, HH_Pres. For the sampsize argument, I specify that it should just be 63.2% of the entire dataset.

有人知道我做错了什么吗?或者这是在意料之中?

Anyone know what I am doing wrong? Or is this to be expected?

谢谢,

丽莎

重现这个问题:

示例数据:https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit

代码:

bll = read.csv("bll_Nov2013_NMV.csv", header=TRUE)
HH_Pres <- bll$HammerHeadALL_Presence

Slope <-bll$Slope
Dist2Shr <-bll$Dist2Shr
Bathy <-bll$Bathy2
Chla <-bll$GSM_Chl_Daily_MF
SST <-bll$SST_PF_daily
Region <- bll$Region
MoonPhase <-bll$MoonPhase
DaylightHours <- bll$DaylightHours
bll_HH <- data.frame(HH_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region)
set.seed(25)

HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
HHrf
set.seed(25)
HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, strata = bll_HH$HH_Pres, sampsize = c(100, 50), ntree = 500, replace = FALSE, importance = TRUE)
HHrf

推荐答案

据我所知,sampsize 参数应该是一个向量,其长度与数据中的类数相同放.如果您在 strata 参数中指定了一个因子变量,则 sampsize 应该被赋予一个与 strata 中的因子数量相同长度的向量代码> 参数.我不确定它是否像您在问题中描述的那样执行,但自从我使用 randomForest 函数以来已经有一段时间了.

As far as I know, the sampsize argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata argument, then sampsize should be given a vector that is the same length as the number of factors in the strata argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest function.

从帮助文件中,它说:

地层

用于分层抽样的(因子)变量.

A (factor) variable that is used for stratified sampling.

样本大小:

要绘制的样本大小.对于分类,如果 sampsize 是一个向量层数的长度,然后抽样按以下方式分层层,sampsize 的元素表示要绘制的数字来自地层.

Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

例如,由于您的分类有 2 个不同的类,您需要给 sampsize 一个长度为 2 的向量,指定在训练期间您希望从每个类中采样的观察值数量.

For example, since your classification has 2 distinct classes, you need to give sampsize a vector of length 2 that specifies how many observations you want to sample from each class during training time.

例如sampsize=c(100,50)

此外,您可以指定组的名称以使其更加清晰.

Furthermore, you can specify the names of the groups to be extra clear.

例如sampsize=c('0'=100, '1'=50)

帮助文件中使用 sampsize 参数的示例,以阐明:

An example from the help files that uses the sampsize argument, to clarify:

## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))

randomForest 中添加了一些关于 strata 参数的注释.

Added some notes about the strata argument in randomForest.

确保 strata 参数被赋予一个因子变量!

Make sure the strata argument is given a factor variable!

例如尝试 strata = factor(HH_Pres), sampsize = c(...) 其中 c(...) 是一个与 length 长度相同的向量(水平(因子(bll_HH$HH_Pres)))

e.g. try strata = factor(HH_Pres), sampsize = c(...) where c(...) is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))

好的,我试着用你的数据运行代码,它对我有用.

OK, I tried running the code with your data, and it works for me.

# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)

# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                      Slope + MoonPhase + Chla + Region,
                    data=bll_HH, ntree = 500, replace = FALSE, 
                    importance = TRUE, na.action = na.omit)
HHrf

# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 425 15  0.03409091
# 1  86  8  0.91489362

# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 18.91%
# Confusion matrix:
#     0  1 class.error
# 0 424 16  0.03636364
# 1  85  9  0.90425532

请注意,在这种情况下,OOB 错误估计是相同的,即使我们只使用了引导程序样本中每个类的 63.2% 的数据.这可能是由于使用的样本大小与训练数据中的类分布成正比,并且数据集的大小相对较小.让我们尝试更改 mySampSize 以确保它真的有效.

Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize to make sure it REALLY worked.

# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50

set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
                       Slope + MoonPhase + Chla + Region,
                     data=bll_HH, ntree = 500, replace = FALSE, 
                     importance = TRUE, na.action = na.omit,
                     sampsize=mySampSize)
HHrf
# Output
#         OOB estimate of  error rate: 21.16%
# Confusion matrix:
#     0  1 class.error
# 0 382 58   0.1318182
# 1  55 39   0.5851064

这篇关于分层抽样似乎不会改变 randomForest 结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆