R:对具有因子的数据集进行交叉验证 [英] R: Cross validation on a dataset with factors

查看:105
本文介绍了R:对具有因子的数据集进行交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常,我想对包含一些因子变量的数据集运行交叉验证,并且运行一段时间后,交叉验证例程会失败并显示以下错误: factor x具有新的Y级



例如,使用软件包启动

  library(boot) 
d<-data.frame(x = c('A','A','B','B','C','C'),y = c(1,2,3, 4,5,6))
m<-glm(y〜x,data = d)
m.cv<-cv.glm(d,m,K = 2)#有时成功
m.cv<-cv.glm(d,m,K = 2)
#model.frame.default(Terms,newdata,na.action = na.action,xlev = object $中的错误xlevels):
#因数x具有新水平B






更新:这是一个玩具示例。对于较大的数据集,也会发生相同的问题,其中出现了几次 C 级,但是在 training 分区中都不存在。 / p>




包中的函数 createDataPartition 函数http://cran.r-project.org/web/packages/caret/ rel = nofollow noreferrer> 插入符 进行了分层采样结果变量并正确警告:


此外,对于'createDataPartition',非常小班级规模(< = 3),班级可能不会同时出现在训练和测试数据中。


有两种解决方案让人想到:


  1. 首先,通过选择每个因素水平的一个随机样本创建数据的子集首先,从稀有类开始(按频率),然后贪婪地满足下一个稀有类,依此类推。然后在其余数据集中使用 createDataPartition 并将结果合并以创建一个新的 train 数据集,其中包含所有级别

  2. 使用 createDataPartitions 并进行拒绝采样。

到目前为止,由于数据量大,选项 2 对我有用,但是我不禁认为必须有比手更好的解决方案



理想情况下,我想要一个解决方案,该解决方案可以正常工作用于创建分区,但如果 early 失败,没有创建此类分区的方法。



软件包不提供此功能的基本理论原因是吗?他们提供它吗,我只是因为盲点而无法发现它们?有没有更好的方法来进行分层抽样?



如果我应该在 stats.stackoverflow.com






更新



这就是我的手推出的解决方案(2)的样子:

  get.cv.idx<-函数(train.data,folds,factor.cols = NA){

if(is.na(factor.cols)){
全部.cols<-colnames(train.data)
factor.cols<-all.cols [laply(llply(train.data [1,],class),function(x)'factor'%in% x)]
}

n<-nrow(train.data)
test.n<-floor(1 / fold * n)

cond.met<-假
n.tries<-0

$ b而(!cond.met){
n.tries<-n.tries + 1
test.idx<-sample(nrow(train.data),test.n)
train.idx<-setdiff(1:nrow(train.data),test.idx)

cond.met<-真

for(factor.col在factor.cols中){
train.levels<-train.data [train.idx,factor.col]
test.levels<-train.data [test.idx,factor.col]
if(length(unique(train.levels))< length(unique(test.levels))){
cat('Factor level:',factor.col,'违反约束条件,重试。\n')
cond.met<-FALSE
}
}
}

cat('Done in',n.tries,'trie(s).\n')

list(train.idx = train.idx
,test.idx = test.idx

}


解决方案

每个人都认为肯定有最佳解决方案。但就我个人而言,我只是尝试 cv.glm 调用,直到使用 while

  m.cv <-try(cv.glm(d,m,K = 2) )#先尝试
class(m.cv)#有时会出错,有时会列出
而(Inherits(m.cv, try-error)){
m.cv<-试试(cv.glm(d,m,K = 2))
}
class(m.cv)#总是列出

我已经尝试过在数据中添加100,000行,而且只需几秒钟。

 库(引导)
n< -100000
d<-data.frame(x = c(rep('A',n),rep('B',n) ,'C','C'),y = 1:(n * 2 + 2))
m<-glm(y〜x,data = d)

m.cv < ;-try(cv.glm(d,m,K = 2))
class(m.cv)#有时出错,有时会列出
而(继承(m.cv, try-error )){
m.cv <-try(cv.glm(d,m,K = 2))
}
class(m.cv)#总是列出


Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.

For example, using package boot:

library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor x has new levels B


Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.


The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:

Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.

There are two solutions which spring to mind:

  1. First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
  2. Using createDataPartitions and and doing rejection sampling.

So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.

Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.

Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?

Please leave a comment if I should ask this question on stats.stackoverflow.com.


Update:

This is what my hand rolled out solution (2) looks like:

get.cv.idx <- function(train.data, folds, factor.cols = NA) {

    if (is.na(factor.cols)) {
        all.cols        <- colnames(train.data)
        factor.cols     <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
    }

    n                   <- nrow(train.data)
    test.n              <- floor(1 / folds * n)

    cond.met            <- FALSE
    n.tries             <- 0

    while (!cond.met) {
        n.tries         <- n.tries + 1
        test.idx        <- sample(nrow(train.data), test.n)
        train.idx       <- setdiff(1:nrow(train.data), test.idx)

        cond.met        <- TRUE

        for(factor.col in factor.cols) {
            train.levels <- train.data[ train.idx, factor.col ]
            test.levels  <- train.data[ test.idx , factor.col ]
            if (length(unique(train.levels)) < length(unique(test.levels))) {
                cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
                cond.met <- FALSE
            }
        }
    }

    cat('Done in ', n.tries, ' trie(s).\n')

    list( train.idx = train.idx
        , test.idx  = test.idx
        )
}

解决方案

Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.

m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.

library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)

m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list

这篇关于R:对具有因子的数据集进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆