从R中留一法获取p值 [英] Getting p-values from leave-one-out in R

查看：105 发布时间：2020/9/21 3:22:19 r bioinformatics cross-validation

本文介绍了从R中留一法获取p值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有96个观察值(患者)和1098个变量(基因)的数据框.响应为二进制(Y和N)，预测变量为数字.我正在尝试进行留一法交叉验证，但我的兴趣不是标准误差，而是从LOOCV创建的95个逻辑回归模型中的每个变量的p值.到目前为止，这是我的尝试:

I have a data frame of 96 observations (patients) and 1098 variables (genes). The response is binary (Y and N) and the predictors are numeric. I am trying to perform leave-one-out cross validation, but my interest is not standard error, but the p-values for each variable from each of the 95 logistic regression models created from LOOCV. These are my attempts thus far:

#Data frame 96 observations 1098 variables
DF2

fit <- list()

for (i in 1:96){
  df <- DF2[-i,]
 fit[[i]] <- glm (response ~., data= df, family= "binomial")
 }
 model_pvalues <- data.frame(model = character(), p_value = numeric())

此输出适合作为包含16个元素和30个元素的列表的大型列表:$ coefficients，$ residuals，$ fitted.values ....

This outputs fit as a large list with 16 elements and a list of 30: $coefficients, $residuals, $fitted.values....

尝试1:

for (i in length(fit)){ 
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}

此输出转换为"model_pvalues" 95个观察值(Intercept和94个变量)和4个变量:Evaluate，Std.误差，z值，Pr(> | z |).但是，我真正想要得到的是所有1097个变量的p值，对于通过留出一个交叉验证构建的95个模型而言.

This output into "model_pvalues" 95 observations (Intercept and 94 variables) and 4 variables: Estimate, Std. Error, z value, Pr(>|z|). However what I am really trying to get is the p-value for all 1097 variables, for 95 models constructed by leave one out cross validation.

尝试2:

for (i in length(fit)){ 
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]]))[4])
}

当我执行此操作时，我得出了一个变量的一个数字(不确定，假设是beta).

When I ran this I get out one number (not sure from where, assuming a beta) for one variable.

尝试3:

for (i in 1:96){
  df <- DF2[-i,]
  fit[[i]] <- glm (response ~., data= df, family= "binomial")
  model_pvalues <- rbind(model_pvalues, coef(summary(fit[[i]])))
}

运行此命令时，我得到15个观测值的数据框，其中包含4个变量:估计值，标准值.误差，z值，Pr(> | z |).观察以(Intercept)开始，后跟82个变量.之后，它使用(Intercept1)和相同的82个变量重复此模式，直到(Intercept15).

When I run this I get out a data frame of 1520 observations of 4 variables: Estimate, Std. Error, z value, Pr(>|z|). The observations begin with (Intercept) followed by 82 variables. After that it repeats this pattern with (Intercept1) and the same 82 variables, up until (Intercept15).

所以我的最终目标是通过LOOCV创建95个模型，并获取所有模型中使用的所有1097个变量的p值.任何帮助将不胜感激！

So my end goal is to create 95 models via LOOCV and to get the p-values for all 1097 variables used in all models. Any help would be very much appreciated!

示例数据(对1098个变量的实际DF 96观测值)

example data (real DF 96 observations for 1098 variables)

  Response  X1  X2  X3  X4  X5  X6  X7  X8  X9  X10

P1  N       1   1   1   0   1   0   1   0   2    2
P2  N       2   1   1   0   2   2   1   2   2    2
P3  N       2   1   2   1   1   0   1   1   0    1
P4  Y       1   1   2   0   1   0   0   1   1    1
P5  N       2   2   1   1   1   0   0   0   1    1
P6  N       2   1   2   1   1   0   0   0   2    1
P7  Y       2   1   1   0   2   0   0   0   2    0
P8  Y       2   1   1   0   2   0   0   1   0    2
P9  N       1   1   1   0   2   0   0   0   1    0
P10 N       2   1   2   1   1   0   1   0   0    2

推荐答案

对于n观测值(96为真实数据，在示例数据中为10)和p变量(1098对于真实数据，在10中为真实数据).示例数据)，下面的代码应按p值的n列矩阵提取p行.我觉得有必要警告您，尝试拟合n<<p案例(相对于参数数量的观察很少)可能具有极差的统计属性，甚至可能是不可能的，除非您使用惩罚回归等技术. ..这也可能是为什么这么多参数从估计中丢失的原因(即，您可能只有1097个变量中只有94个)-特别是因为您的表达式模式很简单(仅0、1或2) )，大量参数是共线的，无法共同估算(您在原始模型拟合中也应该看到很多NA).

For n observations (96 for your real data, 10 in the example data) and p variables (1098 for your real data, 10 in the example data), the code below should extract a p row by n column matrix of p-values. I feel obliged to warn you that trying to fit an n<<p case (very few observations relative to the number of parameters) is likely to have extremely poor statistical properties, and maybe even be impossible, unless you use a technique like penalized regression ... this is also probably the reason why so many of your parameters are missing from the estimates (i.e. you're only getting 94 out of a possible 1097 variables) - especially since your expression patterns are simple (only 0, 1, or 2), a large number of the parameters are collinear and can't be jointly estimated (you should have seen a lot of NAs in your original model fit, too).

获取示例数据:

DF2 <- read.table(row.names=1,header=TRUE,text="
Resp. X1  X2  X3  X4  X5  X6  X7  X8  X9  X10
P1  N   1   1   1   0   1   0   1   0   2   2
P2  N   2   1   1   0   2   2   1   2   2   2
P3  N   2   1   2   1   1   0   1   1   0   1
P4  Y   1   1   2   0   1   0   0   1   1   1
P5  N   2   2   1   1   1   0   0   0   1   1
P6  N   2   1   2   1   1   0   0   0   2   1
P7  Y   2   1   1   0   2   0   0   0   2   0
P8  Y   2   1   1   0   2   0   0   1   0   2
P9  N   1   1   1   0   2   0   0   0   1   0
P10 N   2   1   2   1   1   0   1   0   0   2")

适合的型号

n <- nrow(DF2)
fit <- vector(mode="list",n) ## best to pre-allocate objects
for (i in 1:n) {
  df <- DF2[-i,]
  fit[[i]] <- glm (Resp. ~., data= df, family= "binomial")
}

在这种情况下，我们必须稍微谨慎地提取p值，因为由于共线性，它们中的一些丢失了-R在系数向量(coef())中留下了NA以便未估计参数，但不会类似地填写摘要中系数表的行.

In this case we have to be a little bit careful extracting the p-values because, due to collinearity, some of them are missing - R leaves an NA in the coefficient vector (coef()) for non-estimated parameters, but doesn't similarly fill in rows of the coefficient table in the summary.

tmpf <- function(x) {
    ## extract coef vector - has NA values for collinear terms
    ## [-1] is to drop the intercept
    r1 <- coef(x)[-1]
    ## fill in values from p-value vector; leave out intercept with -1,
    r2 <- coef(summary(x))[-1,"Pr(>|z|)"]
    r1[names(r2)] <- r2
    return(r1)
}
pvals <- sapply(fit,tmpf)

当然，对于玩具示例，所有p值基本上等于1 ...

Of course, for the toy example, all of the p-values are essentially equal to 1 ...

## round(pvals,4)
##       [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
## X1  0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9999 0.9998 0.9999 0.9998
## X2  0.9999 0.9999 0.9999 0.9999     NA 0.9999 0.9999 0.9999 0.9999 0.9999
## X3  0.9999 0.9999 0.9999 0.9999 0.9999 0.9998 0.9999 0.9999 0.9999 0.9999
## X4  0.9998 0.9998 0.9998     NA 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
## X5      NA 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000     NA 1.0000
## X6  0.9999     NA 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
## X7  1.0000 1.0000 1.0000 1.0000 1.0000     NA 1.0000 1.0000 1.0000 1.0000
## X8  1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
## X9  1.0000 1.0000     NA 1.0000 1.0000 1.0000     NA     NA 1.0000     NA
## X10     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA

这篇关于从R中留一法获取p值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从R中留一法获取p值 [英] Getting p-values from leave-one-out in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从R中留一法获取p值 [英] Getting p-values from leave-one-out in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭