如何使用matchpattern()在R中具有许多sequence(.fasta)的文件中查找某些氨基酸 [英] how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

查看:188
本文介绍了如何使用matchpattern()在R中具有许多sequence(.fasta)的文件中查找某些氨基酸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件(mydata.txt),其中包含许多带有 fasta 的外显子序列> 格式.我想找到每个DNA序列的起始('atg')和终止('taa','tga','tag')密码子(考虑框架).我尝试使用matchPattern( Biostrings中的函数 R包)来找到这些氨基酸:

I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids:

例如,mydata.txt可能是:

As an example mydata.txt could be:

>a
atgaatgctaaccccaccgagtaa
>b
atgctaaccactgtcatcaatgcctaa
>c
atggcatgatgccgagaggccagaataggctaa
>d
atggtgatagctaacgtatgctag
>e
atgccatgcgaggagccggctgccattgactag

file=read.fasta(file="mydata.txt") 
matchPattern( "atg" , file)

注意:read.fasta是seqinr软件包中的一个函数,用于导入fasta格式文件.

Note: read.fasta is a function in seqinr package that used to import fasta format files.

但是此命令不起作用!如何使用此功能查找每个外显子序列的起始密码子和终止密码子? (无帧偏移)

But this commands didn't work! How can I use this function to find start and stop codons in each exon sequence? (without frame shifting)

推荐答案

matchPattern的'subject'参数是一个特殊对象(例如XString).您可以通过粘贴将其折叠并使用?BString来将序列转换为XString.

The 'subject' argument for matchPattern is a special object (e.g. XString). You can convert your sequences to XStrings by collapsing them with paste and using ?BString.

因此,使用您的数据:

file = read.fasta(file = "mydata.txt")

# find 'atg' locations
atg <- lapply(file, function(x) {
  string <- BString(paste(x, collapse = ""))
  matchPattern("atg", string)
})

atg[1:2]
# $a
#   Views on a 18-letter BString subject
# subject: atgacccccaccgagtaa
# views:
#     start end width
# [1]     1   3     3 [atg]
#
# $b
#  Views on a 21-letter BString subject
# subject: atgcccactgtcatcacctaa
# views:
#     start end width
# [1]     1   3     3 [atg]

举一个简单的例子,找到序列中"atg"的数量和位置:

For a simple example, finding the number and locations of 'atg's in a sequence:

sequence <- BString("atgatgccatgcccccatgcatgatatg")
result <- matchPattern("atg", sequence)
#   Views on a 28-letter BString subject
# subject: atgatgccatgcccccatgcatgatatg
# views:
#     start end width
# [1]     1   3     3 [atg]
# [2]     4   6     3 [atg]
# [3]     9  11     3 [atg]
# [4]    17  19     3 [atg]
# [5]    21  23     3 [atg]
# [6]    26  28     3 [atg]

# Find out how many 'atg's were found
length(result)
# [1] 6

# Get the start site of each 'atg'
result@ranges@start
# [1]  1  4  9 17 21 26

此外,请检出?DNAString?RNAString.它们与BString相似,只是它们仅限于核苷酸特征,并且可以快速比较DNA和RNA序列.

Also, check out ?DNAString and ?RNAString. They are similar to BString only they are limited to nucleotide characters, and allow for quick comparisons between DNA and RNA sequences.

编辑以解决注释中提到的移帧问题: 您可以使用@DWin提到的模数技巧对结果进行子集化,以得到符合框架的'atg'.

Edit to address frame shifting concern mentioned in the comments: You can subset the result to get those 'atg's that are in frame using the modulo trick mentioned by @DWin.

# assuming the first 'atg' sets the frame
in.frame.result <- result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
# Views on a 28-letter DNAString subject
# subject: ATGATGCCATGCCCCCATGCATGATATG
# views:
#     start end width
# [1]     1   3     3 [ATG]
# [2]     4   6     3 [ATG]

# There are two 'atg's in frame in this result
length(in.frame.result)
# [1] 2

# With your data:
file = read.fasta(file = "mydata.txt")
atg <- lapply(file, function(x) {
  string <- BString(paste(x, collapse = ""))
  result <- matchPattern("atg", string)
  result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
})

这篇关于如何使用matchpattern()在R中具有许多sequence(.fasta)的文件中查找某些氨基酸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆