循环以实现“留一法"观察并运行glm,一次运行一个变量 [英] Loop to implement Leave-One-Out observation and run glm, one variable at a time

查看:68
本文介绍了循环以实现“留一法"观察并运行glm,一次运行一个变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含96个观测值和1106个变量的数据框.

I have a data frame with 96 observations and 1106 variables.

  • 我想通过一次不填写,一次对观察结果进行逻辑回归. (因此,对于第一组观察值,将删除第一个观察值的总数为95,对于第二组观察值,将除去第二个观察值的总数为95,依此类推,因此有95套观察值,每组都有一个观察遗漏了.)

  • I would like to run logistic regression on the observations by leaving one out, one at a time. (So for the first set of observations there would be 95 total with the first observation removed, the second set of observations there would be 95 total with the second observation removed, and so forth so that there are 95 sets of observations that each have one observation left out.)

此外,我想一次只对一个变量运行每组这些观察值.在对一个变量进行95个观测值的回归后,我想提取p值(不包含截距p值).

In addition, I would like to run each set of these observations on only one variable at a time. After running the regression for 95 observations on one variable, I would like to extract the p-value (leaving out the intercept p-value).

我已经能够一次手动完成所有这些操作.但是,执行此操作96次非常繁琐,而且我确信必须有一种方法可以使用一个或多个循环将其自动化.

I have been able to do all of this manually, one at a time. However it is very tedious to do this 96 times and I'm sure there must be a way to automate this with a loop or multiple loops.

这是我如何手动进行10次观察的演示.

Here is a demonstration of how I have been doing this manually for 10 observations.

## Create 10 data frames by removing one observation from each ##
di.1 <- mainDF [-1,]
di.2 <- mainDF [-2,]
di.3 <- mainDF [-3,]
di.4 <- mainDF [-4,]
di.5 <- mainDF [-5,]
di.6 <- mainDF [-6,]
di.7 <- mainDF [-7,]
di.8 <- mainDF [-8,]
di.9 <- mainDF [-9,]
di.10 <- mainDF [-10,]

## Create data frames to put each p-value result in ## 
dt.1 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.2 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.3 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.4 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.5 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.6 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.7 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.8 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.9 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.10 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)

## Run logistic regression on each data frame with one one obs. left out ##
## GLM run on one variable at a time##
## Extract p-values and put in separate dfs ##

for (i in 2:1106)
{
  formulas <- glm(response ~ di.1[,i], data=di.1, family= "binomial")
  dt.1[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.2[,i], data=di.2, family= "binomial")
  dt.2[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.3[,i], data=di.3, family= "binomial")
  dt.3[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.4[,i], data=di.4, family= "binomial")
  dt.4[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.5[,i], data=di.5, family= "binomial")
  dt.5[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.6[,i], data=di.6, family= "binomial")
  dt.6[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.7[,i], data=di.7, family= "binomial")
  dt.7[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.8[,i], data=di.8, family= "binomial")
  dt.8[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.9[,i], data=di.9, family= "binomial")
  dt.9[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
  formulas <- glm(response ~ di.10[,i], data=di.10, family= "binomial")
  dt.10[i,] <- coef(summary(formulas))[,4]
}

## Remove intercept p-values ##
dt.1<- dt.1[-c(1)]
dt.2<- dt.2[-c(1)]
dt.3<- dt.3[-c(1)]
dt.4<- dt.4[-c(1)]
dt.5<- dt.5[-c(1)]
dt.6<- dt.6[-c(1)]
dt.7<- dt.7[-c(1)]
dt.8<- dt.8[-c(1)]
dt.9<- dt.9[-c(1)]
dt.10<- dt.10[-c(1)]

## Export data frames, then manually copy and paste them into one CSV ##
write.csv(dt.1, file = "MyData.csv")
write.csv(dt.2, file = "MyData2.csv")
write.csv(dt.3, file = "MyData3.csv")
write.csv(dt.4, file = "MyData4.csv")
write.csv(dt.5, file = "MyData5.csv")
write.csv(dt.6, file = "MyData6.csv")
write.csv(dt.7, file = "MyData7.csv")
write.csv(dt.8, file = "MyData8.csv")
write.csv(dt.9, file = "MyData9.csv")
write.csv(dt.10, file = "MyData10.csv")

我想知道如何才能完成所有这些工作,而不必一次完成一次观察.

I would like to know how I could do all of this work without having to go through each observation one at a time.

这是我正在使用的数据块:

Here is a chunk of the data that I am using:

  Response  X1  X2  X3  X4  X5  X6  X7  X8  X9  X10

P1  N       1   1   1   0   1   0   1   0   2    2
P2  N       2   1   1   0   2   2   1   2   2    2
P3  N       2   1   2   1   1   0   1   1   0    1
P4  Y       1   1   2   0   1   0   0   1   1    1
P5  N       2   2   1   1   1   0   0   0   1    1
P6  N       2   1   2   1   1   0   0   0   2    1
P7  Y       2   1   1   0   2   0   0   0   2    0
P8  Y       2   1   1   0   2   0   0   1   0    2
P9  N       1   1   1   0   2   0   0   0   1    0
P10 N       2   1   2   1   1   0   1   0   0    2

非常感谢您的光临!

推荐答案

正如我在前面的评论中所说,我不会使用glmsummary.glm,因为这对于您的任务来说太慢了将适合96 * 1106 GLM.我将使用glm.fit,并自己计算回归系数的p值.下面的功能f可以做到这一点.它以一维向量x作为协变量(不允许使用NA)和另一个一维向量y作为响应(不允许使用NA).由于完成了逻辑回归,因此y必须是两个级别的因子(或0-1二进制值).

As I said earlier in my comment, I won't use glm and summary.glm as that will be too slow for your task, given that you are going to fit 96 * 1106 GLM. I will use glm.fit, and compute p-values for regression coefficients myselft. The function f below does this. It takes a 1D-vector x as covariate (no NA allowed) and another 1D-vector y as response (no NA allowed). Since Logistic regression is done, it is required that y is a factor of two levels (or 0-1 binary values).

f <- function (x, y) {
  ## call `glm.fit`
  fit <- glm.fit(cbind(1,x), y, family = binomial())
  ## estimated regression coefficients
  beta <- unname(fit$coefficients)
  ## since there are only two coefficients, I don't bother using `chol2inv`
  ## then extract square root of diagonals for standard errors
  se <- sqrt(diag(chol2inv(fit$qr$qr, size = fit$qr$rank)))
  ## deal with possible rank-deficient case
  if (length(se) < 2L) se <- c(se, NA_real_)
  ## z-score
  z <- beta / se
  ## p-value (0.05 significance level)
  2 * pnorm(-abs(z))
  }

如果您不相信它的正确性,我们可以对此功能进行测试.以您的示例数据框dat为例,我们做Response ~ X1:

We can have a test on this function, if you don't trust it for correctness. Take your example data frame dat as an example and we do Response ~ X1:

dat <- 
structure(list(Response = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 
2L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor"), X1 = c(1L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), X2 = c(1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 1L), X3 = c(1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
2L), X4 = c(0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), X5 = c(1L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), X6 = c(0L, 2L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L), X7 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L), X8 = c(0L, 2L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L), X9 = c(2L, 
2L, 0L, 1L, 1L, 2L, 2L, 0L, 1L, 0L), X10 = c(2L, 2L, 1L, 1L, 
1L, 1L, 0L, 2L, 0L, 2L)), .Names = c("Response", "X1", "X2", 
"X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), row.names = c("P1", 
"P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10"), class = "data.frame")

## code response into factor
dat[[1]] <- factor(dat[[1]])

## call `f`
f(dat[[2]], dat[[1]])
# [1] 0.8559137 0.8804148

## call `glm` + `summary.glm`
coef(summary(glm(Response ~ X1, data = dat, family = binomial())))
#              Estimate Std. Error    z value  Pr(>|z|)
#(Intercept) -0.4700036   2.588435 -0.1815783 0.8559137
#X1          -0.2231436   1.483239 -0.1504434 0.8804148

所以p值匹配!

我们现在需要另一个函数g来组织您打算做的双重嵌套循环.外循环控制留一",而内循环由lapply排列以遍历数据帧列.在外循环的每次迭代结束时,将所得的p值数据帧写入".csv"文件.

We now need another function g to organize what you plan to do as a double-nested loop. The outer loop controls "leave-one-out", while the inner loop is arranged by lapply to loop through data frame columns. In the end of each iteration of outer loop, the resulting data frame of p-values are written to a ".csv" file.

g <- function (dat) {
  ## convert response to factor (if it is not readily is)
  y <- as.factor(dat[[1]])
  ## leave-one-out
  for (i in 1:nrow(dat)) {
    ## covariates data frame
    covariates <- dat[-i, -1]
    ## response vector
    response <- y[-i]
    ## call `f` to get a data frame of p-values
    result <- as.data.frame(lapply(covariates, f, y = response))
    ## write data frame to file
    write.csv(result, file = paste0(i,".csv"), row.names = FALSE)
    }
  }

运行g(dat)时,按预期获得了十个".csv"文件.

When I run g(dat), I get ten ".csv" files as expected.

后续行动:

我仍在掌握如何在R中执行循环,因此看到这一点非常有帮助.将其应用于我的数据时,是否可以将要使用的数据框的名称放在dat中?我是否需要在glm.fit函数部分中指定数据帧?

I am still grasping how to do loops in R so seeing this is very helpful. In applying this to my data, would I put the name of the data frame I'd like to use in the dat? And do I need to specify the data frame in the glm.fit function portion?

不. glm.fit(和lm.fit也是)没有公式界面.它只需要数值矩阵而不会丢失任何值,即可直接进行矩阵计算以获得估计值.这就是为什么它比glm更快的原因.它不获取和摘要数据帧.您可以阅读?glm.fit来了解需要什么参数.

No. glm.fit (and lm.fit, too) has no formula interface. It takes only numerical matrices without missing values for direct matrix computations to get estimation. This is exactly why it is faster than glm. It does not take and digest a data frame. You can read ?glm.fit to see what arguments it takes.

您的数据框dat不必具有列名.如上所述,我们在任何地方都没有使用过公式接口.函数g假定dat的第一列是响应,而所有其他列都是自变量.另外,g不会检查缺失值/NA,因此您应确保dat没有不完整的大小写.这些只是gf的要求.

Your data frame dat does not have to have column names. As said above, we have in nowhere used formula interface. The function g assumes that the first column of dat is response, while all other columns are independent variables. Also, g does not check missing values / NA, so you should ensure that dat has no incomplete cases. These are only requirements for g and f.

dat中具有列名的唯一优点是,这些列名将作为标头写入导出的".csv"文件中,这可能会提高可读性.

The only advantage of having column names in dat, is that those column names will be written as header in the exported ".csv" files, which may increase readability.

这篇关于循环以实现“留一法"观察并运行glm,一次运行一个变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆