用于多个变量回归的回路&输出一个子集 [英] for loops for regression over multiple variables & outputting a subset

查看:138
本文介绍了用于多个变量回归的回路&输出一个子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图应用这个QA:高效的循环逻辑回归在R / a>对我自己的问题,但我不能使它的工作。我还没有尝试使用apply,但是有人告诉我一个for循环在这里是最好的(如果有人相信,请随意解释!)我认为这个问题是相当普遍的,而不是太深奥的这是我想要实现的:我有一个数据集,包含3个预测变量(性别,年龄,种族)和一个因变量(一个比例)为86几个人的基因位置。我想要为每个位置运行双变量线性回归(因此,对于3个预测变量,所以需要86个线性回归)。然后,我想以一些易读的格式输出结果;我的想法是矩阵行=性别,年龄和种族,列= 86的职位。每行*列组合将会有一个p值。然后,我可以将p值<0.1(或任何我想要的阈值)称为哪个预测指标与每个位置的比例显着相关。



这是我到目前为止的代码。

  BB< ;  -  seq.csv [,6:91]#包含86个位置的数据帧
AA < - seq.csv [,2:4]#包含3个预测变量
$的数据帧b $ b linreg< - matrix(NA,3,86)#制作一个结果向量并用NA
填充(我在1:86)#loop在每个位置变量
{
for(j in 1:3)#for each position variable,loop over each predictor
{
linreg [i,j] < - lm(BB [,i]〜AA [,j ])#双变量线性回归
}}

无论如何改变,简化它循环的位置只有一个预测变量),我仍然得到一个错误,我的矩阵是不一样的长度(要替换的项目数不是替换长度的倍数)。实际上,长度(linreg)= 286(3 * 86),长度(BB)= 86,长度(AA)= 3。我知道后两个是数据框,而不是矩阵...但是,如果我将它们转换为矩阵,我会得到一个无效的类型错误(无效的类型(列表)为变量'BB [,我]')。我不知道如何解决这个错误,因为我只是不太明白R ...我已经咨询了应用统计遗传学与R编程和R编程艺术,没有用,我一直在谷歌搜索整天。我甚至没有得到输出结果的编码...

我很感激任何调试技巧或一些更好的方法来编写代码的建议!感谢大家提前。

解决方案

真的很难给出一个明确的答案,您的数据事先,但这个可能工作。我假设你的两个数据框有相同的行数(观测值):

pre $ df < - cbind(AA [,2:4],BB [,6:91])$ ​​b $ b mods < - apply(as.data.frame(df [,4:89]),2,FUN = function(x){lm (x〜df [,1] + df [,2] + df [,3]})

#这个矩阵的行将对应于拦截,性别,年龄,种族和列是每个遗传位置的结果
pvals < - sapply(mods,function(x){summary(x)$ coefficients [,4])

至于这是否是正确的做法,我相信您作为遗传流行病学家的判断力! p>

I have tried to apply this QA: "efficient looping logistic regression in R" to my own problem but I cannot quite make it work. I haven't tried to use apply, but I was told by a few people that a for loop is the best here (if someone believes otherwise please feel free to explain!) I think this problem is pretty generalizeable and not too esoteric for the forum.

This is what I want to achieve: I have a dataset with 3 predictor variables (gender, age, race) and a dependent variable (a proportion) for 86 genetic positions for several people. I want to run bivariate linear regressions for each position (so 86 linear regressions for 3 predictor variables). Then I want to output the results in some easily legible format; my idea is a matrix with rows=gender, age, and race, and columns=the 86 positions. There would be a p value for each row*column combination. Then I could call the p values<0.1 (or whatever threshold I want) to easily see which predictors are significantly associated with proportion at each position.

This is the code I have so far.

BB <- seq.csv[,6:91]   #the data frame containing the 86 positions
AA <- seq.csv[,2:4]    #the data frame containing the 3 predictor variables

linreg <- matrix(NA,3,86)  #make a results vector and fill it with NA
    for (i in 1:86)     #loop over each position variable
    {
              for (j in 1:3)  #for each position variable, loop over each predictor
    {
              linreg[i,j] <- lm(BB[,i]~AA[,j])  #bivariate linear regression
}}

No matter how I change this (for example, simplifying it to loop over the positions for only one predictor), I still get an error that my matrices are not the same length (number of items to replace is not a multiple of replacement length). In fact, length(linreg)=286 (3*86) and length(BB)=86 and length(AA)=3. I know the latter two are dataframes, not matrices...but if I convert them to matrices I get an invalid type error (invalid type (list) for variable 'BB[, i]'). I do not know how to resolve this error because I just don't understand R well enough...I've consulted the books Applied Statistical Genetics with R and Art of R Programming to no avail, and I'm been Google searching all day. And I haven't even gotten to the coding for outputting the results...

I'd appreciate any debugging tips or some suggestions on a better way to code this! Thank you all in advance.

解决方案

Really hard to give a definitive answer without knowing the structure of your data beforehand, but this might work. I'm assuming that your two data frames have the same number of rows (observations):

df <- cbind( AA[ , 2:4 ] , BB[ , 6:91 ] )
mods <- apply( as.data.frame( df[ , 4:89 ] ) , 2 , FUN = function(x){ lm( x ~ df[,1] + df[,2] + df[,3] } )

# The rows of this matrix will correspond to the intercept, gender, age, race, and the columns are the results for each of your 86 genetic postions
pvals <- sapply( mods , function(x){ summary(x)$coefficients[,4] )

As to whether or not that is the right thing to do I will trust to your judgement as a genetic epidemiologist!

这篇关于用于多个变量回归的回路&amp;输出一个子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆