循环以实现“留一法"观察并运行glm,一次运行一个变量 [英] Loop to implement Leave-One-Out observation and run glm, one variable at a time
问题描述
我有一个包含96个观测值和1106个变量的数据框.
I have a data frame with 96 observations and 1106 variables.
-
我想通过一次不填写,一次对观察结果进行逻辑回归. (因此,对于第一组观察值,将删除第一个观察值的总数为95,对于第二组观察值,将除去第二个观察值的总数为95,依此类推,因此有95套观察值,每组都有一个观察遗漏了.)
I would like to run logistic regression on the observations by leaving one out, one at a time. (So for the first set of observations there would be 95 total with the first observation removed, the second set of observations there would be 95 total with the second observation removed, and so forth so that there are 95 sets of observations that each have one observation left out.)
此外,我想一次只对一个变量运行每组这些观察值.在对一个变量进行95个观测值的回归后,我想提取p值(不包含截距p值).
In addition, I would like to run each set of these observations on only one variable at a time. After running the regression for 95 observations on one variable, I would like to extract the p-value (leaving out the intercept p-value).
我已经能够一次手动完成所有这些操作.但是,执行此操作96次非常繁琐,而且我确信必须有一种方法可以使用一个或多个循环将其自动化.
I have been able to do all of this manually, one at a time. However it is very tedious to do this 96 times and I'm sure there must be a way to automate this with a loop or multiple loops.
这是我如何手动进行10次观察的演示.
Here is a demonstration of how I have been doing this manually for 10 observations.
## Create 10 data frames by removing one observation from each ##
di.1 <- mainDF [-1,]
di.2 <- mainDF [-2,]
di.3 <- mainDF [-3,]
di.4 <- mainDF [-4,]
di.5 <- mainDF [-5,]
di.6 <- mainDF [-6,]
di.7 <- mainDF [-7,]
di.8 <- mainDF [-8,]
di.9 <- mainDF [-9,]
di.10 <- mainDF [-10,]
## Create data frames to put each p-value result in ##
dt.1 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.2 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.3 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.4 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.5 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.6 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.7 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.8 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.9 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
dt.10 <- data.frame(intercept=numeric(), gene=numeric(), stringsAsFactors=FALSE)
## Run logistic regression on each data frame with one one obs. left out ##
## GLM run on one variable at a time##
## Extract p-values and put in separate dfs ##
for (i in 2:1106)
{
formulas <- glm(response ~ di.1[,i], data=di.1, family= "binomial")
dt.1[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.2[,i], data=di.2, family= "binomial")
dt.2[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.3[,i], data=di.3, family= "binomial")
dt.3[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.4[,i], data=di.4, family= "binomial")
dt.4[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.5[,i], data=di.5, family= "binomial")
dt.5[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.6[,i], data=di.6, family= "binomial")
dt.6[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.7[,i], data=di.7, family= "binomial")
dt.7[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.8[,i], data=di.8, family= "binomial")
dt.8[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.9[,i], data=di.9, family= "binomial")
dt.9[i,] <- coef(summary(formulas))[,4]
}
for (i in 2:1106)
{
formulas <- glm(response ~ di.10[,i], data=di.10, family= "binomial")
dt.10[i,] <- coef(summary(formulas))[,4]
}
## Remove intercept p-values ##
dt.1<- dt.1[-c(1)]
dt.2<- dt.2[-c(1)]
dt.3<- dt.3[-c(1)]
dt.4<- dt.4[-c(1)]
dt.5<- dt.5[-c(1)]
dt.6<- dt.6[-c(1)]
dt.7<- dt.7[-c(1)]
dt.8<- dt.8[-c(1)]
dt.9<- dt.9[-c(1)]
dt.10<- dt.10[-c(1)]
## Export data frames, then manually copy and paste them into one CSV ##
write.csv(dt.1, file = "MyData.csv")
write.csv(dt.2, file = "MyData2.csv")
write.csv(dt.3, file = "MyData3.csv")
write.csv(dt.4, file = "MyData4.csv")
write.csv(dt.5, file = "MyData5.csv")
write.csv(dt.6, file = "MyData6.csv")
write.csv(dt.7, file = "MyData7.csv")
write.csv(dt.8, file = "MyData8.csv")
write.csv(dt.9, file = "MyData9.csv")
write.csv(dt.10, file = "MyData10.csv")
我想知道如何才能完成所有这些工作,而不必一次完成一次观察.
I would like to know how I could do all of this work without having to go through each observation one at a time.
这是我正在使用的数据块:
Here is a chunk of the data that I am using:
Response X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
P1 N 1 1 1 0 1 0 1 0 2 2
P2 N 2 1 1 0 2 2 1 2 2 2
P3 N 2 1 2 1 1 0 1 1 0 1
P4 Y 1 1 2 0 1 0 0 1 1 1
P5 N 2 2 1 1 1 0 0 0 1 1
P6 N 2 1 2 1 1 0 0 0 2 1
P7 Y 2 1 1 0 2 0 0 0 2 0
P8 Y 2 1 1 0 2 0 0 1 0 2
P9 N 1 1 1 0 2 0 0 0 1 0
P10 N 2 1 2 1 1 0 1 0 0 2
非常感谢您的光临!
推荐答案
正如我在前面的评论中所说,我不会使用glm
和summary.glm
,因为这对于您的任务来说太慢了将适合96 * 1106
GLM.我将使用glm.fit
,并自己计算回归系数的p值.下面的功能f
可以做到这一点.它以一维向量x
作为协变量(不允许使用NA
)和另一个一维向量y
作为响应(不允许使用NA
).由于完成了逻辑回归,因此y
必须是两个级别的因子(或0-1二进制值).
As I said earlier in my comment, I won't use glm
and summary.glm
as that will be too slow for your task, given that you are going to fit 96 * 1106
GLM. I will use glm.fit
, and compute p-values for regression coefficients myselft. The function f
below does this. It takes a 1D-vector x
as covariate (no NA
allowed) and another 1D-vector y
as response (no NA
allowed). Since Logistic regression is done, it is required that y
is a factor of two levels (or 0-1 binary values).
f <- function (x, y) {
## call `glm.fit`
fit <- glm.fit(cbind(1,x), y, family = binomial())
## estimated regression coefficients
beta <- unname(fit$coefficients)
## since there are only two coefficients, I don't bother using `chol2inv`
## then extract square root of diagonals for standard errors
se <- sqrt(diag(chol2inv(fit$qr$qr, size = fit$qr$rank)))
## deal with possible rank-deficient case
if (length(se) < 2L) se <- c(se, NA_real_)
## z-score
z <- beta / se
## p-value (0.05 significance level)
2 * pnorm(-abs(z))
}
如果您不相信它的正确性,我们可以对此功能进行测试.以您的示例数据框dat
为例,我们做Response ~ X1
:
We can have a test on this function, if you don't trust it for correctness. Take your example data frame dat
as an example and we do Response ~ X1
:
dat <-
structure(list(Response = structure(c(1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor"), X1 = c(1L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), X2 = c(1L, 1L, 1L, 1L, 2L,
1L, 1L, 1L, 1L, 1L), X3 = c(1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
2L), X4 = c(0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), X5 = c(1L,
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), X6 = c(0L, 2L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), X7 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
1L), X8 = c(0L, 2L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L), X9 = c(2L,
2L, 0L, 1L, 1L, 2L, 2L, 0L, 1L, 0L), X10 = c(2L, 2L, 1L, 1L,
1L, 1L, 0L, 2L, 0L, 2L)), .Names = c("Response", "X1", "X2",
"X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"), row.names = c("P1",
"P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10"), class = "data.frame")
## code response into factor
dat[[1]] <- factor(dat[[1]])
## call `f`
f(dat[[2]], dat[[1]])
# [1] 0.8559137 0.8804148
## call `glm` + `summary.glm`
coef(summary(glm(Response ~ X1, data = dat, family = binomial())))
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -0.4700036 2.588435 -0.1815783 0.8559137
#X1 -0.2231436 1.483239 -0.1504434 0.8804148
所以p值匹配!
我们现在需要另一个函数g
来组织您打算做的双重嵌套循环.外循环控制留一",而内循环由lapply
排列以遍历数据帧列.在外循环的每次迭代结束时,将所得的p值数据帧写入".csv"文件.
We now need another function g
to organize what you plan to do as a double-nested loop. The outer loop controls "leave-one-out", while the inner loop is arranged by lapply
to loop through data frame columns. In the end of each iteration of outer loop, the resulting data frame of p-values are written to a ".csv" file.
g <- function (dat) {
## convert response to factor (if it is not readily is)
y <- as.factor(dat[[1]])
## leave-one-out
for (i in 1:nrow(dat)) {
## covariates data frame
covariates <- dat[-i, -1]
## response vector
response <- y[-i]
## call `f` to get a data frame of p-values
result <- as.data.frame(lapply(covariates, f, y = response))
## write data frame to file
write.csv(result, file = paste0(i,".csv"), row.names = FALSE)
}
}
运行g(dat)
时,按预期获得了十个".csv"文件.
When I run g(dat)
, I get ten ".csv" files as expected.
后续行动:
我仍在掌握如何在R中执行循环,因此看到这一点非常有帮助.将其应用于我的数据时,是否可以将要使用的数据框的名称放在
dat
中?我是否需要在glm.fit
函数部分中指定数据帧?
I am still grasping how to do loops in R so seeing this is very helpful. In applying this to my data, would I put the name of the data frame I'd like to use in the
dat
? And do I need to specify the data frame in theglm.fit
function portion?
不. glm.fit
(和lm.fit
也是)没有公式界面.它只需要数值矩阵而不会丢失任何值,即可直接进行矩阵计算以获得估计值.这就是为什么它比glm
更快的原因.它不获取和摘要数据帧.您可以阅读?glm.fit
来了解需要什么参数.
No. glm.fit
(and lm.fit
, too) has no formula interface. It takes only numerical matrices without missing values for direct matrix computations to get estimation. This is exactly why it is faster than glm
. It does not take and digest a data frame. You can read ?glm.fit
to see what arguments it takes.
您的数据框dat
不必具有列名.如上所述,我们在任何地方都没有使用过公式接口.函数g
假定dat
的第一列是响应,而所有其他列都是自变量.另外,g
不会检查缺失值/NA
,因此您应确保dat
没有不完整的大小写.这些只是g
和f
的要求.
Your data frame dat
does not have to have column names. As said above, we have in nowhere used formula interface. The function g
assumes that the first column of dat
is response, while all other columns are independent variables. Also, g
does not check missing values / NA
, so you should ensure that dat
has no incomplete cases. These are only requirements for g
and f
.
在dat
中具有列名的唯一优点是,这些列名将作为标头写入导出的".csv"文件中,这可能会提高可读性.
The only advantage of having column names in dat
, is that those column names will be written as header in the exported ".csv" files, which may increase readability.
这篇关于循环以实现“留一法"观察并运行glm,一次运行一个变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!