的R - 序列比对功能,时间太长运行 [英] R - sequence alignment function taking too long to run

查看:348
本文介绍了的R - 序列比对功能,时间太长运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我是相当新的R和我有一个运行时的问题。我写了下面的嵌套while循环使用Biostrings包(biocLite),以便从两个物种链接蛋白质序列,如果他们有一个90%的比对评分。

基本上,我输入2蛋白的基因组,在SeqData1比较每个氨基酸序列与从SeqData2各氨基酸序列,计算一个对齐得分,并且如果分数> 90%予串联的匹配和蛋白质名称的列表所述SeqData2蛋白质的序列。

该功能不工作完全像它应该,唯一的问题是,与蛋白质,它能够扫描我预测,运行时整个事情是1.4左右的月数。牛逼

有没有人对如何成倍加快这一功能的运行有什么建议?

谢谢!

R code:

  SeqScore<  - 函数()
{

源(http://bioconductor.org/biocLite.R)
biocLite()
要求(Biostrings)
数据(BLOSUM100)


SeqData1<  -  readDNAStringSet(SeqData1.fasta)
proNum1 = 84390#数量SEQ1蛋白质

SeqData2<  -  readDNAStringSet(SeqData2.fasta)
proNum2 = 15194#数序列的蛋白质


#Create空单,以填补%的分数和匹配序列:
DLIST = NULL
QueSeqList = NULL
TotList = NULL

#initiating计数器:
I = 1
J = 1
C = 0

#Perform对齐并生成百分比同一性分数:
而(I< = proNum1)
  {
  而(J< = proNum2)
    {
  SeqAlign<  -  pairwiseAlignment(SeqData1 [我],SeqData2 [J],substitutionMatrix = BLOSUM100,gapOpening = 0,gapExtension = -5)
  PercAlign<  -  PID(SeqAlign)
  如果(PercAlign> = 90)
  {
    DLIST = C(DLIST,名称(SeqData1 [I]),名(SeqData2 [J]))
    QueSeqList = C(QueSeqList,的toString(SeqData2 [J]))
    C = C + 1
  }
  其他{C = C + 1;打印(C)}
  当J = J + 1个
  }
  I = I + 1
  J = 1 q若要重设内部while循环
  }
不公开< -t(sapply(DLIST,不公开));
outputMatrix< -cbind(DLIST,QueSeqList)
outputMatrix&其中; -as.matrix(outputMatrix,NCOL = 3)
write.csv(outputMatrix,outputMatrix.csv)
}
 

解决方案

在帮助页面,我觉得 pairwiseAlignment 工作时,它的第一个参数是 DNAStringSet 任何长度的,所以外环可以通过

  SeqAlign<  -  pairwiseAlignment(SeqData1,SeqData2 [J]。
    substitutionMatrix = BLOSUM100,gapOpening = 0,gapExtension = -5)
 

我不会用一个,而循环,而是瞄准了线的 lapply ,以及

 结果<  -  lapply(SeqData2,功能(elt_j){
        SeqAlign<  -  pairwiseAlignment(SeqData1,elt_j,
            substitutionMatrix = BLOSUM100,gapOpening = 0,gapExtension = -5)
        PID(SeqAlign)
})
data.frame(的toString(SeqData1),代表(的toString(SeqData2),每=长度(SeqData1)),
    不公开(PID,use.names = FALSE))
 

这可以节省你从R魔族的第二圈,同时也提出了一个简单的策略并行这种在非Windows,如果计算仍缓慢:

 库(平行)
选项​​(mc.cores = detectCores())
结果<  -  mclapply(seq_along(... ##和以前一样
 

这是更好地询问它的长度数据,而不是它单独 proNum1 =长度(SeqData1)键入。此外,我希望你不使用 biocLite()每次运行脚本的时间;只是一次包安装到任何给定的R安装。

您将获得在 Bioconductor的邮件列表 - 无需订阅

So, I am fairly new to R and I have a runtime question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score.

Basically, I input two protein genomes, compare each amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein.

The function does work exactly like it should, only problem is that with the number of proteins that it has to scan I am projecting that the runtime for the whole thing is around 1.4 months. T

Does anyone have any suggestions on how to exponentially speed up the runtime of this function?

Thanks!

R code:

SeqScore <- function()
{

source("http://bioconductor.org/biocLite.R")
biocLite()
require("Biostrings")
data(BLOSUM100)


SeqData1 <- readDNAStringSet("SeqData1.fasta")
proNum1 = 84390     # number of proteins in Seq1

SeqData2 <- readDNAStringSet("SeqData2.fasta")
proNum2 = 15194     # number of proteins in Seq


#Create empty list to fill with percent scores and matching sequences:
DList=NULL
QueSeqList = NULL
TotList = NULL

#initiating the counters:
i=1
j=1
c=0

#Perform alignment and generate percent identity scores:
while(i<=proNum1)
  {
  while(j<=proNum2)
    {
  SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)
  PercAlign <- pid(SeqAlign)
  if(PercAlign>=90)
  {
    DList = c(DList, names(SeqData1[i]), names(SeqData2[j]))
    QueSeqList = c(QueSeqList, toString(SeqData2[j]))
    c=c+1
  }
  else{c=c+1; print(c)}
  j=j+1
  }
  i=i+1
  j=1 #to reset the inner while loop
  }
unlist<-t(sapply(DList, unlist));
outputMatrix<-cbind(DList,QueSeqList)
outputMatrix<-as.matrix(outputMatrix, ncol=3)
write.csv(outputMatrix, "outputMatrix.csv")
}

解决方案

From the help page, I think that pairwiseAlignment works when it's first argument is a DNAStringSet of any length, so the outer loop can be replaced by

SeqAlign <- pairwiseAlignment(SeqData1, SeqData2[j],
    substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)

I wouldn't use a while loop, but instead aim for an lapply, along the lines of

results <- lapply(SeqData2, function(elt_j) {
        SeqAlign <- pairwiseAlignment(SeqData1, elt_j,
            substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)
        pid(SeqAlign)
})
data.frame(toString(SeqData1), rep(toString(SeqData2), each=length(SeqData1)),
    unlist(pid, use.names=FALSE))

This saves you from the second circle of the R inferno, and also suggests a simple strategy to parallelize this on non-Windows if the calculations are still slow:

library(parallel)
options(mc.cores=detectCores())
results <- mclapply(seq_along(... ## as before

It's better to ask the data about it's length than to type it in separately proNum1 = length(SeqData1). Also I hope you're not using biocLite() each time you run the script; just once to install the package into any given R installation.

You'll get a more authoritative answer on the Bioconductor mailing list -- no subscription required.

这篇关于的R - 序列比对功能,时间太长运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆