data.table替代嵌套for-loop? [英] data.table alternatives to nested for-loop?

查看:194
本文介绍了data.table替代嵌套for-loop?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个矩阵:diff.exp.protein.expr和 lncRNA.expr ,每个矩阵都有64列,但行数不同。

 > dim(diff.exp.protein.expr)
[1] 14000 64

> dim(lncRNA.expr)
[1] 7600 64



在两个单独的线性模型中使用这些矩阵作为输入,其中我将diff.exp.protein.expr的每一行与 lncRNA.expr (= 2 * 106)的所有行进行比较万次测试)。最后,我想比较两个线性模型是否在统计上不同使用anova,我已经写了一个函数,如下:

  lm.anova < -  function(diff.exp.protein.expr,lncRNA.expr,colData)
{
tempdff< - data.frame()#create一个空数据框
for(i in 1:nrow(diff.exp.protein.expr))#traverse通过第一个矩阵
{
for(j in 1:nrow(lncRNA.expr))#
{
#model 1
lm1< - lm(diff.exp.protein.expr [i,]〜colData $ gender + colData $ age + colData $ treating)
#model 2(在末尾有一个额外的交互项)
lm2 < - lm(diff.exp.protein.expr [i,]〜colData $ gender + colData $ age + colData $ treatment + lncRNA.expr [j,] + colData $ treatment * lncRNA.expr [j,])
#get模型2的摘要
lm.model< - summary(lm2)
#矩阵1的第i行和矩阵2的第j行
#从anova模型获取pval
#get估计和std。模型2的最后两个项的错误
#将这些值作为一行添加到tempdff
tempdff< - rbind(tempdff,data.frame(rownames(diff.exp.protein.expr)[i] ,rwnames(lncRNA.expr)[j],
anova(lm1,lm2)$Pr(> F)[2]),lm.model $ coefficients [4,1:2] model $ coefficients [6,1:2])
}
}
return(tempdff)#return results
}

lm.anova.res < - lm.anova(diff.exp.protein.expr,lncRNA.expr,colData)#call function

这是进入函数的数据如何:

 > head(diff.exp .protein.expr)[,1:5] #log转换值

C00060 C00079 C00135 C00150 C00154
ENSG00000000005.5 5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8 8.713285 8.584996 8.982816 9.787420 8.823927
ENSG00000001497.12 9.569952 9.658418 9.392670 9.394250 9.087214
ENSG00000001617.7 9.215362 9.417367 8.949943 8.172713 9.198314
ENSG00000001626.10 6.363638 6.038328 6.698733 5.202132 5.336239

> head(lncRNA.expr)[,1:5] #log转换值
C00060 C00079 C00135 C00150 C00154
ENSG00000100181.17 7.477983 7.776463 7.950514 8.098205 7.278615
ENSG00000115934.11 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000122043.6 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000124915.6 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000125514.5 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000129816 .5 4.104380 4.104380 4.104380 4.104380 4.104380

> head(colData)[,1:5]#两个输入矩阵的每列的样本信息
sample_name status age gender site treatment
C00060 NF 48女士克利夫兰案例
C00079 NF 26女士克利夫兰案例
C00135 NF 55女士克利夫兰案例
C00150 NF 61男士克利夫兰ctrl
C00154 NF 50男士克利夫兰案例
C00176 NF 60女士Cleveland ctrl

我写了这段代码, )。现在,我已经确定了,除了我不知道如何使我的代码有效,因为在这种情况下使用一个for循环,其中14000 * 7600测试必须执行没有任何意义。它将需要永远运行。我想知道什么是其他的选择,通过它我真的可以加速这段代码。



UPDATE 1 :我已经更新了我的线性模型。现在,anova(lm1,lm2)Pr(> F)不给出与model2中的交互项的值相同的值。



UPDATE 2



感谢,



首先,调用 data.frame(...) )是非常昂贵的,所以在每次迭代调用它会大大减慢的事情。另一方面,列表是非常高效的。



其次,可能只是运行2 * 1亿6千万的回归占用大部分时间。



我会倾向于这样尝试:

  ##未测试。 .. 
df1 < - t(diff.exp.protein.expr)
df2 < - t(lncRNA.expr)

df < - cbind(df1, df2,colData)
names< - expand.grid(x = colnames(df1),y = colnames(df2),stringsAsFactors = FALSE)
get.anova& b $ b form.1< - as.formula(paste0(n [1],〜gender + age + treatment +,n [2]))
form.2 < - as.formula (n [1],〜性别+年龄+治疗+,n [2],+治疗:,n [2]))
lm1 <
lm2 <-lm(form.2,data = df)
coef < - summary(lm2)$ coefficients
list(n [1],n [2],anova $ cob [5,2],coef [6,1],coef [6,2])
}
result< - do.call(rbind,apply(names,1,get.anova))
result< - data.frame(result)
colnames(result) c(蛋白,lncRNA,p.value,est.1,se.1,est.2,se.2)

这并未测试,因为您提供的数据集不够大: 0 df为错误。因此,在系数表中没有行6。



EDIT (回应OP的评论和提供资料)。



使用注释中提供的数据集,我们可以对原始代码(基于您更新的帖子)和新版本(基于上面的代码,更新以反映您的新模型公式) 。在这两种情况下,我每个数据集只使用10行(100组合)。

  f.original < .anova(diff.exp.protein.expr.sub [1:10,],lncRNA.expr.sub [1:10,],colData)
f.new< - function()do.call
库(microbenchmark)
microbenchmark(f.original(),f.new(),times = 10)
#单位:秒
#expr min lq median uq max neval
#f.original()2.622461 2.712527 2.824419 2.869193 2.914099 10
#f.new()2.049756 2.074909 2.144422 2.191546 2.224831 10

所以我们可以看到,新版本返回列表而不是数据框,速度提高了25%。



使用 Rprof 分析这两种方法,可以看出原始版本花费了大约50%的时间在 lm(...),而新版本在 lm(...)中花费大约60%这是有道理的,并建议,你可能做的最好的是比新版本大约30%的改进。换句话说: lm(...)的调用是瓶颈:2亿次调用 lm(...)只需要很长时间。



您可以考虑使用并行计算方法,例如使用 foreach ,但在下去之前,你应该考虑其他获得最终结果的策略。


I have two matrices diff.exp.protein.expr and lncRNA.expr each of which have 64 columns but different number of rows.

>dim(diff.exp.protein.expr)
[1] 14000 64

>dim(lncRNA.expr)
[1] 7600 64

I am using these matrices as input in two separate linear models where I compare each row of diff.exp.protein.expr with all rows of lncRNA.expr (=2*106 million tests). Eventually, I want to compare whether the two linear models are statistically different using anova for which I have written a function which is as follows:

lm.anova <- function(diff.exp.protein.expr,lncRNA.expr,colData) 
{
    tempdff <- data.frame() #create an empty dataframe
    for(i in 1:nrow(diff.exp.protein.expr)) #traverse through 1st matrix
    {
        for(j in 1:nrow(lncRNA.expr)) #traverse through 2nd matrix
        {
            #model 1
            lm1 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment)
            #model 2 (has an additional interaction term at the end)
            lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+colData$treatment*lncRNA.expr[j,])
            #get the summary of model 2
            lm.model <- summary(lm2)
            #get the rownames of ith row of matrix 1 and jth row of matrix 2
            #get the pvalue from the anova model
            #get the estimate and std. error for last two terms of model 2
            #add these values as a row to tempdff
            tempdff <- rbind(tempdff,data.frame(rownames(diff.exp.protein.expr)[i],rownames(lncRNA.expr)[j],
                                      anova(lm1,lm2)$"Pr(>F)"[2]),lm.model$coefficients[4,1:2],lm.model$coefficients[6,1:2])
        }
    }
    return(tempdff) #return results
}

lm.anova.res <- lm.anova(diff.exp.protein.expr,lncRNA.expr,colData) #call function

And this is how the data, that goes into the function, looks like:

>head(diff.exp.protein.expr)[,1:5] #log transformed values

                     C00060   C00079   C00135   C00150   C00154
ENSG00000000005.5  5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8  8.713285 8.584996 8.982816 9.787420 8.823927
ENSG00000001497.12 9.569952 9.658418 9.392670 9.394250 9.087214
ENSG00000001617.7  9.215362 9.417367 8.949943 8.172713 9.198314
ENSG00000001626.10 6.363638 6.038328 6.698733 5.202132 5.336239

>head(lncRNA.expr)[,1:5] #log transformed values
                     C00060   C00079   C00135   C00150   C00154
ENSG00000100181.17 7.477983 7.776463 7.950514 8.098205 7.278615
ENSG00000115934.11 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000122043.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000124915.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000125514.5  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000129816.5  4.104380 4.104380 4.104380 4.104380 4.104380

>head(colData)[,1:5] #sample information for each column of the two input matrices
sample_name status age gender      site treatment
     C00060     NF  48 Female Cleveland      case
     C00079     NF  26 Female Cleveland      case
     C00135     NF  55 Female Cleveland      case
     C00150     NF  61   Male Cleveland      ctrl
     C00154     NF  50   Male Cleveland      case
     C00176     NF  60 Female Cleveland      ctrl

I wrote this code when I had very few tests (~50) to be performed. Now, I have it all figured out except that I don't know how can I make my code efficient because using a for-loop in this case where 14000*7600 tests have to be performed doesn't make any sense. It would take forever to run. I would like to know what are other alternatives by which I can really speed up this code. Any suggestion would be appreciated.

UPDATE 1: I have updated my linear models a bit. Now, anova(lm1,lm2)"Pr(>F)" does not give the same value as that for the interaction term in model2.

UPDATE 2: I have updated my linear models so that lm1 is nested in lm2.

Thanks,

解决方案

So there are a couple of things.

First, calling data.frame(...) is very expensive, so calling it at each iteration will slow things down considerably. Lists, on the other hand, are extremely efficient. So try to keep everything in lists until the end.

Second, it might just be that running 2 * 106 million regressions is taking most of the time.

I would be inclined to try it this way:

## not tested...
df1 <- t(diff.exp.protein.expr)
df2 <- t(lncRNA.expr)

df  <- cbind(df1,df2,colData)
names <- expand.grid(x=colnames(df1),y=colnames(df2),stringsAsFactors=FALSE)
get.anova <- function(n){
  form.1 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2]))
  form.2 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2],"+treatment:",n[2]))
  lm1    <- lm(form.1,data=df)
  lm2    <- lm(form.2,data=df)
  coef <- summary(lm2)$coefficients
  list(n[1],n[2],anova(lm1,lm2)$"Pr(>F)"[2],coef[5,1],coef[5,2],coef[6,1],coef[6,2])
}
result <- do.call(rbind,apply(names,1,get.anova))
result <- data.frame(result)
colnames(result) <- c("protein","lncRNA","p.value","est.1","se.1","est.2","se.2")

This was not tested because the dataset you provided was not big enough: the models have < 0 df for error. Consequently, there is no row 6 in the coefficients table. Your real dataset will not have that problem.

EDIT (Response to OP's comment and provision of data).

With the dataset provided in your comment, we can benchmark both the original code (based on your updated post), and the new version (based on the code above, updated to reflect your new model formulas). In both case I use just 10 rows from each dataset (100 combinations).

f.original <- function() lm.anova(diff.exp.protein.expr.sub[1:10,],lncRNA.expr.sub[1:10,],colData)
f.new      <- function() do.call(rbind,apply(names,1,get.anova))
library(microbenchmark)
microbenchmark(f.original(), f.new(), times=10)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  f.original() 2.622461 2.712527 2.824419 2.869193 2.914099    10
#       f.new() 2.049756 2.074909 2.144422 2.191546 2.224831    10

So we can see that the new version, which returns lists instead of data frames, is about 25% faster.

Profiling both approaches, using Rprof, shows that the original version spends about 50% of it's time in lm(...), whereas the new version spends about 60% of it's (shorter) time in lm(...). This makes sense and suggests that the best you could possibly do is an improvement of about 30% over the new version. In other words: the calls to lm(...) are the bottlenecks: 200 million calls to lm(...) just takes a long time.

You could consider a parallel computing approach, for example using the foreach package, but before going down that road IMO you should consider other strategies for getting to your end result.

这篇关于data.table替代嵌套for-loop?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆