如何使用matchpattern()在R中具有许多sequence(.fasta)的文件中查找某些氨基酸 [英] how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R
问题描述
我有一个文件(mydata.txt),其中包含许多带有 fasta 的外显子序列> 格式.我想找到每个DNA序列的起始('atg')和终止('taa','tga','tag')密码子(考虑框架).我尝试使用matchPattern
( Biostrings
中的函数 R包)来找到这些氨基酸:
I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern
( a function from the Biostrings
R package) to find theses amino acids:
例如,mydata.txt可能是:
As an example mydata.txt could be:
>a
atgaatgctaaccccaccgagtaa
>b
atgctaaccactgtcatcaatgcctaa
>c
atggcatgatgccgagaggccagaataggctaa
>d
atggtgatagctaacgtatgctag
>e
atgccatgcgaggagccggctgccattgactag
file=read.fasta(file="mydata.txt")
matchPattern( "atg" , file)
注意:read.fasta是seqinr
软件包中的一个函数,用于导入fasta格式文件.
Note: read.fasta is a function in seqinr
package that used to import fasta format files.
但是此命令不起作用!如何使用此功能查找每个外显子序列的起始密码子和终止密码子? (无帧偏移)
But this commands didn't work! How can I use this function to find start and stop codons in each exon sequence? (without frame shifting)
推荐答案
matchPattern
的'subject'参数是一个特殊对象(例如XString).您可以通过粘贴将其折叠并使用?BString
来将序列转换为XString.
The 'subject' argument for matchPattern
is a special object (e.g. XString). You can convert your sequences to XStrings by collapsing them with paste and using ?BString
.
因此,使用您的数据:
file = read.fasta(file = "mydata.txt")
# find 'atg' locations
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
matchPattern("atg", string)
})
atg[1:2]
# $a
# Views on a 18-letter BString subject
# subject: atgacccccaccgagtaa
# views:
# start end width
# [1] 1 3 3 [atg]
#
# $b
# Views on a 21-letter BString subject
# subject: atgcccactgtcatcacctaa
# views:
# start end width
# [1] 1 3 3 [atg]
举一个简单的例子,找到序列中"atg"的数量和位置:
For a simple example, finding the number and locations of 'atg's in a sequence:
sequence <- BString("atgatgccatgcccccatgcatgatatg")
result <- matchPattern("atg", sequence)
# Views on a 28-letter BString subject
# subject: atgatgccatgcccccatgcatgatatg
# views:
# start end width
# [1] 1 3 3 [atg]
# [2] 4 6 3 [atg]
# [3] 9 11 3 [atg]
# [4] 17 19 3 [atg]
# [5] 21 23 3 [atg]
# [6] 26 28 3 [atg]
# Find out how many 'atg's were found
length(result)
# [1] 6
# Get the start site of each 'atg'
result@ranges@start
# [1] 1 4 9 17 21 26
此外,请检出?DNAString
和?RNAString
.它们与BString
相似,只是它们仅限于核苷酸特征,并且可以快速比较DNA和RNA序列.
Also, check out ?DNAString
and ?RNAString
. They are similar to BString
only they are limited to nucleotide characters, and allow for quick comparisons between DNA and RNA sequences.
编辑以解决注释中提到的移帧问题: 您可以使用@DWin提到的模数技巧对结果进行子集化,以得到符合框架的'atg'.
Edit to address frame shifting concern mentioned in the comments: You can subset the result to get those 'atg's that are in frame using the modulo trick mentioned by @DWin.
# assuming the first 'atg' sets the frame
in.frame.result <- result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
# Views on a 28-letter DNAString subject
# subject: ATGATGCCATGCCCCCATGCATGATATG
# views:
# start end width
# [1] 1 3 3 [ATG]
# [2] 4 6 3 [ATG]
# There are two 'atg's in frame in this result
length(in.frame.result)
# [1] 2
# With your data:
file = read.fasta(file = "mydata.txt")
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
result <- matchPattern("atg", string)
result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
})
这篇关于如何使用matchpattern()在R中具有许多sequence(.fasta)的文件中查找某些氨基酸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!