如何在R中执行基本的多序列比对? [英] How to perform basic Multiple Sequence Alignments in R?

查看:386
本文介绍了如何在R中执行基本的多序列比对?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(我曾尝试在 BioStars 上提问,但是来自文本挖掘的人会认为有更好的解决方案,我也在此处重新发布了此信息)

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)

我要实现的任务是对齐多个序列.

The task I'm trying to achieve is to align several sequences.

我没有要匹配的基本模式.我所知道的是,"True"模式的长度应为"30",并且我在随机点处引入的序列缺少值.

I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.

这是此类序列的示例,在左侧,我们看到缺失值的真实位置是什么,在右侧,我们将能够观察到序列.

Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.

我的目标是仅使用我在右列上获得的序列来重建左列(基于每个位置的许多字母相同的事实)

My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)

                     Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG

下面是一个示例代码,用于重现上面的示例:

Here is an example code to reproduce the above example:

ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) 
{
    number.of.changes <- sample(seq_len(number.of.changes), 1)
    new.letters <- sample(letters.to.change.with , number.of.changes, T)
    where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
    x[where.to.change.the.letters] <- new.letters
    return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-") 
insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)

我知道,如果我只拥有一个字符串和一个模式,我将能够使用

I understand that if all I had was a string and a pattern I would be able to use

library(Biostrings)
pairwiseAlignment(...)

但是在我目前的情况下,我们正在处理许多彼此对齐的序列(而不是将它们与一种模式对齐).

But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).

在R中是否有执行此操作的已知方法?

Is there a known method for doing this in R?

推荐答案

尽管这是一个很老的话题,但我不想错过这个机会,因为从Bioconductor 3.1开始,有一个软件包"msa"实现与三种不同的多序列比对算法的接口:ClustalW,ClustalOmega和MUSCLE.该程序包可在所有主要平台(Linux/Unix,Mac OS和Windows)上运行,并且在不需要安装任何外部软件的意义上是自包含的.可以在 http://www.bioinf.jku.at/software/msa/上找到更多信息. http://www.bioconductor.org/packages/版本/bioc/html/msa.html .

Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.

这篇关于如何在R中执行基本的多序列比对?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆