如何按组/子集进行一次留出交叉验证? [英] How to do a Leave One Out cross validation by group / subset?

查看:58
本文介绍了如何按组/子集进行一次留出交叉验证?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是上一个问题的第二部分().

This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).

我正在尝试为每个国家/地区建立模型,并使用留一法"方法生成线性回归预测.换句话说,在下面的代码中,当构建模型1和模型2时,数据"将被创建.使用的不应该是整个数据集.相反,它应该是数据集(国家/地区)的子集.每个国家/地区的数据都应使用该国家/地区特定数据构建的模型进行评估.

I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.

以下代码返回错误.如何修改/修复下面的代码来做到这一点?还是有更好的方法呢?

The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?

library(modelr)
install.packages("gapminder")
library(gapminder)                           
data(gapminder) 

#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)

models <- list(fit_model1 = model1,fit_model2 = model2)

gapminder %>% nest_by(continent, country) %>%
  bind_cols(
    map(1:nrow(gapminder), function(i) {
      map_dfc(models, function(model) {
        training <- data[-i, ] 
        fit <- lm(model, data = training)
        
        validation <- data[i, ]
        predict(fit, newdata = validation)
        
      })
    }) %>%
      bind_rows()
  )
 

推荐答案

最简洁明了的解决方案是嵌套的 for 循环方法,其中外部循环是两个模型公式,内部循环是两个模型公式.循环是我们要遗漏的团结.这也可以通过 outer 完成,我也将在后面显示.

The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.

为清楚起见,我首先说明如何在每次迭代中保留一个观察值(即一行)(第一部分).稍后我将说明如何忽略一个群集(例如国家/地区)(第二部分).我还使用了内置的 iris 数据集,该数据集较小,因此更易于处理.它包含一个种类" 列,该列旨在与您数据中的国家/地区" 相对应.

For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.

首先,我们将两个公式放入列表中,并命名它们,就像我们希望它们稍后出现在结果列中一样.

First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.

FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

对于循环,我们要初始化一个矩阵 im ,该矩阵的行对应于我们要省略的行数,而列对应于模型公式的数.

For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.

im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE), 
             dimnames=list(NULL, names(FOAE)))

这看起来像这样:

head(im, n=3)
#      fit1 fit2
# [1,]   NA   NA
# [2,]   NA   NA
# [3,]   NA   NA

现在,我们如上所述循环遍历公式和行.

Now we loop over formulas and rows as described above.

for (i in seq(FOAE)) {
  for(j in seq(nrow(iris))) {
    train <- iris[-j,]  
    test <- iris[j,]    
    fit <- lm(FOAE[[i]], data=train)
    im[j, i] <- predict(fit, newdata=test)
  }
}

现在已经填充了

im ,我们可以将其 cbind 到原始的 iris 数据集,以得到结果 res1 .

im has now been filled, and we may cbind it to the original iris data set to get our result res1.

res1 <- cbind(iris, im)
head(res1)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

要另外采用 外部方法,我们将代码放在 for 循环内,放入一个公式我们 Vectorize ,以便它可以处理矩阵列(即向量).

To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).

FUN1 <- Vectorize(function(x, y) {
  train <- iris[-x,]
  test <- iris[x,]
  fit <- lm(y, data=train)
  predict(fit, newdata=test)
})

现在,我们将 FOAE 和行 1:nrow(iris)留在后面,并与 FUN1 一起放入外部().这已经给我们提供了结果,我们可以用与上述相同的方法将 cbind 绑定到 iris ,以获得结果 res2 .

Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.

o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)

head(res2)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# 1          5.1         3.5          1.4         0.2  setosa 2.388501 1.611976
# 2          4.9         3.0          1.4         0.2  setosa 2.014324 1.501389
# 3          4.7         3.2          1.3         0.2  setosa 1.639805 1.392955
# 4          4.6         3.1          1.5         0.2  setosa 1.446175 1.333199
# 5          5.0         3.6          1.4         0.2  setosa 2.201646 1.556620
# 6          5.4         3.9          1.7         0.4  setosa 2.944788 2.127184

## test if results are different is negative 
stopifnot(all.equal(res1, res2))

第二部分

当我们忽略一个集群(即物种或国家)时,我们可能会采用类似的方法.我在这里显示 outer 方法.我们要更改的事情是,我们现在要忽略属于特定聚类的观察,这里是"Species" (在您的情况下是"countries" ),我们将哪些 unique 值放入向量 Species.u 中.由于这些值采用字符" "factor" 格式,因此我们使用 data [!data $ cluster%in%x,] 而不是 data [-x,] .因为 predict 将在集群中产生多个值,但是我们希望在各个集群中具有相同的值,所以我们可能希望使用统计信息,例如每个群集的 mean 预测.我们根据集群使用 rownames .

Part II

We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.

FUN2 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)

o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)

现在,这给了我们一个小于数据集的矩阵.多亏了 rownames ,我们可以对它们所属的簇的预测进行 match 匹配.

This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.

o2
#                fit1     fit2
# setosa     3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica  4.911009 5.976922

res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])

head(res3)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa            5.1         3.5          1.4         0.2  setosa 3.609943 2.662609
# setosa.1          4.9         3.0          1.4         0.2  setosa 3.609943 2.662609
# setosa.2          4.7         3.2          1.3         0.2  setosa 3.609943 2.662609
# setosa.3          4.6         3.1          1.5         0.2  setosa 3.609943 2.662609
# setosa.4          5.0         3.6          1.4         0.2  setosa 3.609943 2.662609
# setosa.5          5.4         3.9          1.7         0.4  setosa 3.609943 2.662609

tail(res3)
#              Sepal.Length Sepal.Width Petal.Length Petal.Width   Species     fit1     fit2
# virginica.44          6.7         3.3          5.7         2.5 virginica 4.911009 5.976922
# virginica.45          6.7         3.0          5.2         2.3 virginica 4.911009 5.976922
# virginica.46          6.3         2.5          5.0         1.9 virginica 4.911009 5.976922
# virginica.47          6.5         3.0          5.2         2.0 virginica 4.911009 5.976922
# virginica.48          6.2         3.4          5.4         2.3 virginica 4.911009 5.976922
# virginica.49          5.9         3.0          5.1         1.8 virginica 4.911009 5.976922

编辑

在此版本的 FUN2 ( FUN3 )中,每个集群模型的输出都进行了 rbind (当然,在两列中,因为有两个模型).

Edit

In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).

FUN3 <- Vectorize(function(x, y) {
  train <- iris[!iris$Species %in% x,]
  test <- iris[iris$Species %in% x,]
  fit <- lm(y, data=train)
  (predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)

o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)

res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
#          Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 3.706940 2.678255
# setosa.2          4.9         3.0          1.4         0.2  setosa 3.500562 2.547587
# setosa.3          4.7         3.2          1.3         0.2  setosa 3.294183 2.416919
# setosa.4          4.6         3.1          1.5         0.2  setosa 3.190994 2.351586
# setosa.5          5.0         3.6          1.4         0.2  setosa 3.603751 2.612921
# setosa.6          5.4         3.9          1.7         0.4  setosa 4.016508 3.073249

编辑2

正如我在您的评论中了解到的,您需要1.集群中数据的子集.这将是下面 FUN4 中的 ss .然后,通过在子集 ss 的行上保留一行 z 来对 ss 进行子集化.

Edit 2

As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.

FUN4 <- Vectorize(function(x, y) {
  ## subsets first by cluster then by row
  ss <- iris[iris$Species %in% x,]  ## cluster subset
  sapply(1:nrow(ss), function(z) {  ## subset rows using `sapply`
    train <- ss[-z,]  ## train data w/o row z
    test <- ss[z,]    ## test data for `predict`, just row z
    fit <- lm(y, data=train)
    predict(fit, newdata=test)
  })
}, SIMPLIFY=F)

## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length, 
             fit2=Petal.Length ~ Sepal.Length + Petal.Width)

## unique cluster names
Species.u <- unique(iris$Species)

## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)

## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))

## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species     fit1     fit2
# setosa.1          5.1         3.5          1.4         0.2  setosa 1.476004 1.451029
# setosa.2          4.9         3.0          1.4         0.2  setosa 1.449120 1.431737
# setosa.3          4.7         3.2          1.3         0.2  setosa 1.426185 1.416492
# setosa.4          4.6         3.1          1.5         0.2  setosa 1.404040 1.398103
# setosa.5          5.0         3.6          1.4         0.2  setosa 1.462460 1.441295
# setosa.6          5.4         3.9          1.7         0.4  setosa 1.504990 1.559045

这篇关于如何按组/子集进行一次留出交叉验证?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆