在两个变量列表上进行迭代循环以在R中进行多元回归 [英] looping with iterations over two lists of variables for a multiple regression in R

查看:84
本文介绍了在两个变量列表上进行迭代循环以在R中进行多元回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中编写一个循环,以使用一个因变量和两个自变量列表(所有连续变量)运行多个回归.该模型是可加的,循环应通过遍历两个变量列表来运行,以使其从第一个列表中获取第一列+从第二个列表中获取第一列,然后对两个列表中的第二列进行相同操作,依此类推.问题是我无法正确地遍历列表,相反,我的循环运行了比应有的模型更多的模型.

I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.

我在这里描述的数据帧只是我实际上必须运行3772次的一个子集(我正在研究RNA-seq转录物表达).

The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).

我的数据框称为干燥",包含22个变量(列)和87个观察值(行).第1列包含genotypeID,第2:11列包含一组要迭代的自变量,第12:21列包含第二组要迭代的自变量,第23列包含我的因变量FITNESS_DRY.结构是这样的:

My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:

str(dry)
'data.frame':   87 obs. of  22 variables:
$ geneID     : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25    30 35 36 38 39 ...
$ RDPI_T1    : num  1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2    : num  -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3    : num  -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4    : num  0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5    : num  -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6    : num  2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7    : num  -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8    : num  -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9    : num  0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10   : num  1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1     : num  0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2     : num  1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3     : num  -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4     : num  0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5     : num  0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6     : num  -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7     : num  -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8     : num  0.396 NA NA -0.146 -0.468 ...
$ DRY_T9     : num  -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10    : num  -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num  1.301 3.365 0.458 0.346 1.983 ...

目标是运行10个如下所示的多元回归:

The goal is to run 10 multiple regressions looking like this:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)

依次遍历两个列表的所有十列 就索引而言,这等效于以下内容

and so forth iterating through all ten columns for both lists This is equivalent to the following in terms of indexing

lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])

然后,我的循环应为每个模型计算汇总,并在输出对象中合并所有p值(lm摘要的第4列).

My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.

我首先定义了变量列表

var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)

这是我尝试的无法正常工作的循环:

This is the loop I tried which doesn't work properly:

lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
    for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}

循环返回此错误消息:

Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)

我可以调用对象"lm.test1",但是该对象有27行而不是我想要的10行,因此此处的迭代无法正常工作.有人可以帮忙吗?另外,如果我可以将每个变量列表的列名添加到摘要中,那将是很好的.我尝试将其用于每个变量列表,但没有成功:

I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:

name<-append(name, as.character(colnames(var_list$var1))

有什么想法吗?预先感谢您的帮助!

Any ideas? Thanks in advance for any help!

UPDATE1:有关完整数据集的更多信息: 我的实际数据仍将包含第一个列"geneID",然后我有3772个列名称为DRY_T1 .... DRY_T3772,然后是另一个3772列名称为RDPI_T1 ... RDPI_T3772,最后是我的因变量"FITNESS_DRY".我仍然想像这样运行所有加法模型:

UPDATE1: More information on the full data set: My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)

我这样模拟了一个数据集:

I simulated a dataset as such:

set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))

然后运行for循环:

models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" 
+")),
                                 data = dat3)
}

这在数据模拟中效果很好,但是当我在完全相同的设置下对真实数据集进行尝试时却无法正常工作.循环可能在处理具有两个或多个数字的数字时遇到问题.我收到此错误消息:

This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
0 (non-NA) cases

更新2:确实,该模型在处理两位或多位数字方面存在问题.为了查看原始版本中出现了什么问题,我使用了以下方法: (我的数据集称为"dry2"):

UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this: (my dataset is called "dry2"):

names(dry2)[grepl("2$", names(dry2))]

这将返回所有包含数字"2"的DRY_T和RDPI_T变量,而不是仅一对DRY_T和RDPI_T.

This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.

要解决此问题,此新代码将起作用:

To fix the issue this new code works:

models = list()

for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" +   ")),
data = dry2)
}

推荐答案

有很多方法可以设置模型公式以进行迭代.这是一种方法,我们演示了如何使用for循环或purrr包中的map进行迭代.然后,使用broom包中的tidy来获取系数和p值.

There are a number of ways to set up the model formulas for iteration. Here's one approach, which we demonstrate using a for loop or map from the purrr package for the iteration. Then we use tidy from the broom package to get the coefficients and p-values.

library(tidyverse)
library(broom)

# Fake data
set.seed(2)
dat = as.data.frame(replicate(20, runif(20)))
names(dat) = paste0(rep(c("DRY_T","RDPI_T"),each=10), 0:9)
dat = cbind(dat, FITNESS_DRY=runif(20))

# Generate list of models

# Using for loop
models = list()

for(i in 0:9) {

  # Get the two column names to use for this iteration of the model
  vars = names(dat)[grepl(paste0(i,"$"), names(dat))]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dat)
}

# Same idea using purrr::map to iterate
models = map(0:9 %>% set_names(), 
             ~ {
               vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
               form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
               lm(form, data = dat)
             })

# Check first two models
models[1:2]
#> $`0`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T0      RDPI_T0  
#>      0.4543       0.3025      -0.1624  
#> 
#> 
#> $`1`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T1      RDPI_T1  
#>     0.64511     -0.33293      0.06698

# Get coefficients and p-values for each model in a single data frame
results = map_df(models, tidy, .id="run_number")

results
#> # A tibble: 30 x 6
#>    run_number term        estimate std.error statistic p.value
#>    <chr>      <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#>  1 0          (Intercept)   0.454      0.153     2.96  0.00872
#>  2 0          DRY_T0        0.303      0.197     1.53  0.143  
#>  3 0          RDPI_T0      -0.162      0.186    -0.873 0.395  
#>  4 1          (Intercept)   0.645      0.185     3.49  0.00279
#>  5 1          DRY_T1       -0.333      0.204    -1.63  0.122  
#>  6 1          RDPI_T1       0.0670     0.236     0.284 0.780  
#>  7 2          (Intercept)   0.290      0.147     1.97  0.0650 
#>  8 2          DRY_T2        0.270      0.176     1.53  0.144  
#>  9 2          RDPI_T2       0.180      0.185     0.972 0.345  
#> 10 3          (Intercept)   0.273      0.187     1.46  0.162  
#> # … with 20 more rows

reprex软件包(v0.2.1)于2019-06-28创建 sup>

Created on 2019-06-28 by the reprex package (v0.2.1)

如果不需要保存模型对象,则只需返回系数和p值的数据框:

If you don't need to save the model objects, you can just return the data frame of coefficients and p-values:

results = map_df(0:9 %>% set_names(), 
            ~ {
              vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
              form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
              tidy(lm(form, data = dat))
            }, .id="run_number")

更新:在回答您的评论时,如果将0:9的所有实例替换为1:10(对不起,没有注意到您的列后缀从1:10而不是0 :9),以及dat(我的虚假数据)和dry2(或您用于数据框的任何名称)的所有实例,只要列名是与您在问题中使用的相同.如果您使用不同的列名,则需要通过硬编码新名称或创建一个可以接受您正在使用的模型的任何列名的函数来修改代码.生成.

UPDATE: In answer to your comment, if you replace all instances of 0:9 with 1:10 (sorry, didn't notice that your column suffixes went from 1:10 rather than 0:9), and all instances of dat (my fake data) with dry2 (or whatever name you're using for your data frame), the code will run with your data, so long as the column names are the same as the ones you used in your question. If you're using different column names, you'll need to adapt the code, either by hard-coding the new names or by creating a function that can accept whatever column names you're using for the model(s) you're generating.

要解释代码的作用:首先,我们需要获取要在模型的每次迭代中使用的列的名称.例如,在for循环版本中:

To explain what the code is doing: First, we need to get the names of the columns we want to use in each iteration of the model. For example, in the for-loop version:

vars = names(dry2)[grepl(paste0(i,"$"), names(dry2))]

例如,当i=2时,解析为:

vars = names(dry2)[grepl("2$", names(dry2))]
vars

[1] "RDPI_T2" "DRY_T2"

这就是我们要用来生成回归公式的两列. "2$"是正则表达式(正则表达式是字符串匹配语言),它表示:匹配names(dry2)中以数字'2'结尾的值.

So those are the two columns we want to use to generate a regression formula. "2$" is a regular expression (regular expressions is a string matching language) that means: match values in names(dry2) that end with the number '2'.

要创建公式,请执行以下操作:

To create our formula we do:

paste(vars, collapse=" + ")

[1] "RDPI_T2 + DRY_T2"

form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
form

[1] "FITNESS_DRY ~  RDPI_T2 + DRY_T2"

现在我们有了我们在lm内部使用的回归公式.

And now we have our regression formula which we use inside lm.

每次迭代(使用formap,或在@RomanLuštrik的建议下,使用mapply)都会生成后续模型.

Each iteration (either with for or map or, in @RomanLuštrik's suggestion, mapply), generates the successive models.

更新2:正如我在评论中指出的那样,我意识到当最终数字为多于一位数字.因此,请改用此方法(对于map版本也应如此):

UPDATE 2: As I noted in the comment, I realized that the regular expression paste(i, "$") will fail (by matching more than one of each type of independent variable column) when the final number is more than one digit. So, try this instead (and similarly for the map version):

models = list()

for(i in 1:3772) {

  # Get the two column names to use for this iteration of the model
  vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dry2)
}

要查看原始版本中出现的问题,请运行例如names(dry2)[grepl("2$", names(dry2))]

To see how things go wrong in the original version, run, for example, names(dry2)[grepl("2$", names(dry2))]

这篇关于在两个变量列表上进行迭代循环以在R中进行多元回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆