R:对具有因子的数据集进行交叉验证 [英] R: Cross validation on a dataset with factors
问题描述
通常,我想对包含一些因子变量的数据集运行交叉验证,并且运行一段时间后,交叉验证例程会失败并显示以下错误: factor x具有新的Y级
。
例如,使用软件包启动:
library(boot)
d<-data.frame(x = c('A','A','B','B','C','C'),y = c(1,2,3, 4,5,6))
m<-glm(y〜x,data = d)
m.cv<-cv.glm(d,m,K = 2)#有时成功
m.cv<-cv.glm(d,m,K = 2)
#model.frame.default(Terms,newdata,na.action = na.action,xlev = object $中的错误xlevels):
#因数x具有新水平B
更新:这是一个玩具示例。对于较大的数据集,也会发生相同的问题,其中出现了几次 C
级,但是在 training 分区中都不存在。 / p>
包中的函数 createDataPartition
函数http://cran.r-project.org/web/packages/caret/ rel = nofollow noreferrer> 插入符
进行了分层采样结果变量并正确警告:
此外,对于'createDataPartition',非常小班级规模(< = 3),班级可能不会同时出现在训练和测试数据中。
有两种解决方案让人想到:
- 首先,通过选择每个
因素水平的一个随机样本创建数据的子集
首先,从稀有类开始(按频率),然后贪婪地满足下一个稀有类,依此类推。然后在其余数据集中使用createDataPartition
并将结果合并以创建一个新的 train 数据集,其中包含所有级别
。 - 使用
createDataPartitions
并进行拒绝采样。
到目前为止,由于数据量大,选项 2 对我有用,但是我不禁认为必须有比手更好的解决方案
理想情况下,我想要一个解决方案,该解决方案可以正常工作用于创建分区,但如果 early 失败,没有创建此类分区的方法。
软件包不提供此功能的基本理论原因是吗?他们提供它吗,我只是因为盲点而无法发现它们?有没有更好的方法来进行分层抽样?
如果我应该在 stats.stackoverflow.com 。
更新:
这就是我的手推出的解决方案(2)的样子:
get.cv.idx<-函数(train.data,folds,factor.cols = NA){
if(is.na(factor.cols)){
全部.cols<-colnames(train.data)
factor.cols<-all.cols [laply(llply(train.data [1,],class),function(x)'factor'%in% x)]
}
n<-nrow(train.data)
test.n<-floor(1 / fold * n)
cond.met<-假
n.tries<-0
$ b而(!cond.met){
n.tries<-n.tries + 1
test.idx<-sample(nrow(train.data),test.n)
train.idx<-setdiff(1:nrow(train.data),test.idx)
cond.met<-真
for(factor.col在factor.cols中){
train.levels<-train.data [train.idx,factor.col]
test.levels<-train.data [test.idx,factor.col]
if(length(unique(train.levels))< length(unique(test.levels))){
cat('Factor level:',factor.col,'违反约束条件,重试。\n')
cond.met<-FALSE
}
}
}
cat('Done in',n.tries,'trie(s).\n')
list(train.idx = train.idx
,test.idx = test.idx
)
}
每个人都认为肯定有最佳解决方案。但就我个人而言,我只是尝试
cv.glm
调用,直到使用 while
。
m.cv <-try(cv.glm(d,m,K = 2) )#先尝试
class(m.cv)#有时会出错,有时会列出
而(Inherits(m.cv, try-error)){
m.cv<-试试(cv.glm(d,m,K = 2))
}
class(m.cv)#总是列出
我已经尝试过在数据中添加100,000行,而且只需几秒钟。
库(引导)
n< -100000
d<-data.frame(x = c(rep('A',n),rep('B',n) ,'C','C'),y = 1:(n * 2 + 2))
m<-glm(y〜x,data = d)
m.cv < ;-try(cv.glm(d,m,K = 2))
class(m.cv)#有时出错,有时会列出
而(继承(m.cv, try-error )){
m.cv <-try(cv.glm(d,m,K = 2))
}
class(m.cv)#总是列出
Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y
.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C
but none of them is present in the training partition.
The function createDataPartition
function from the package caret
does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
- First, create a subset of the data by selecting one random sample of each
factor level
first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then usingcreateDataPartition
on the rest of the dataset and merging the results to create a new train dataset which contains alllevels
. - Using
createDataPartitions
and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
Everyone agrees that there sure is an optimal solution. But personally, I would just try
the cv.glm
call until it works usingwhile
.
m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.
library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)
m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
这篇关于R:对具有因子的数据集进行交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!