根据单独文件中的条目从FASTA文件中提取序列 [英] Extract sequences from a FASTA file based on entries in a separate file

查看:421
本文介绍了根据单独文件中的条目从FASTA文件中提取序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件.

文件1:带有基因序列的FASTA文件,其格式如下例所示:

File 1: a FASTA file with gene sequences, formated like this example:

>PITG_00002 | Phytophthora infestans T30-4 conserved hypothetical protein (426 nt)
ATGCATCGCTCGGGTTCCGCACGGAAAGCCCAAGGTCTGGGATTACGGGGTGGTGGTCGG
TTACACTTGGAATAACCTCGCAAATTCAGAATCTCTACAGGCTACGTTCGCGGATGGAAC
>PITG_00003 | Phytophthora infestans T30-4 protein kinase (297 nt)
ATGACGGCTGGGGTCGGTACGCCCTACTGGATCGCACCGGAGATTCTTGAAGGCAAACGG
TACACTGAGCAAGCGGATATTTACTCGTTCGGAGTGGTTTTATCCGAGCTGGACACGTGC
AAGATGCCGTTCTCTGACGTCGTTACGGCAGAGGGAAAGAAACCCAAACCAGTTCAGATC
>PITG_00004 | Phytophthora infestans T30-4 protein kinase, putative (1969 nt)
ATGCGCGTGTCTGGTCTCCTTTCAATTCTTGCAGCCACTTTGACCACGGCCCAAGACTAC

文件2:一个简单的文本文件,带有基因的登录号.像这样.

File 2: A simple text file with JUST the accession identification of the gene. Like so.

PITG_00003
PITG_00005
PITG_00023

文件2中的每个条目都在文件1中的某个位置,但是文件1中并非每个条目都在文件2中.我需要从文件1中删除所有不在文件2中的条目.在可以帮助我的biopython模块中,我只是不知道什么.例如,我本来以为我可以使用SeqIO.parse函数从FASTA文件中仅提取出种质,但这确实使我获得了两个种质号的文件.我不知道如何有选择地提取其他文件中的种质.也许就像将文件2中的所有条目读入字典中,然后将该条目与其在文件1中的匹配条目相关联,并使用SeqIO.parse提取整个序列...但我真的不知道....任何帮助任何人都可以给我,我非常感激!

Every entry in File 2 is somewhere in File 1, but not every entry in File 1 is in File 2. I need to remove all the entries from File 1 that are not in File 2. I feel like there must be something in the biopython module that could help me, I just don't know what. For instance, I originally thought that I could extract just the accessions from my FASTA file using the SeqIO.parse function, but this really just lands me with two files of accession numbers. I don't know how to selectively extract the accessions that are in the other file. Maybe like reading all the entries from File 2 into a dictionary and then associated that entry with its matching entry in File 1 and use SeqIO.parse to extract the whole sequence...But I really don't know....Any help anyone could give me is extremely appreciated!

推荐答案

尝试一下:

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

为简要说明这里发生的情况... accessionids.txt是您的文件2 ,而fasta.txt是您的文件1 .显然,您需要用代码中的实际文件名替换这些文件名.

To briefly explain what's going on here... accessionids.txt is your File 2, whereas fasta.txt is your File 1. Obviously you'll need to replace these filenames with your actual filenames within the code.

首先,我们创建一个字典(有时称为哈希或关联数组),并为文件2 中的每个登录ID创建一个条目,其中 key 为登录ID和 value 设置为1(在这种情况下,该值并不重要).

First, we create a dictionary (sometimes referred to as a hash or associative array) and for every Accession ID in File 2 we create an entry where the key is the Accession ID and the value is set to 1 (not that the value really matters in this case).

接下来,我们查看文件1 ,然后再次查看该文件中的每一行.如果文件中的行以>开头,那么我们知道它包含一个登录ID.我们将这行代码沿|分割,因为每条具有登录ID的行在字符串中都将有一个|.接下来,按照_splitline[0]的指定进行分割的第一部分.我们使用accessorIDWithArrow[1:-1]截断字符串中的第一个和最后一个字符,它们是前面的>符号和后面的空白.

Next we look in File 1 and again look at each line in that file. If the line in the file starts with > then we know that it contains an Accession ID. We take that line and split it along the | since every line with an Accession ID will have a | in the string. Next, take the first part of the split as specified by _splitline[0]. We use accessorIDWithArrow[1:-1] to chop off the first and last characters in the string which are the > symbol in the front and a blank space in the rear.

这时,accessorID现在包含我们希望从文件2 获得的格式的登录号.

At this point, accessorID now contains the Accession ID in the format that we expect from File 2.

接下来,我们检查之前创建并填充的词典是否将此登录ID定义为键.如果是这样,我们立即将具有登录ID的行写入新文件fasta_parsed.txt,并将skip'flag'变量设置/重置为0.然后,包含if not skip段的else语句将允许与我们发现要打印到fasta_parsed.txt文件的Accession ID相关的后续行.

Next, we check if the dictionary we created and populated earlier has this Accession ID defined as a key. If it does, we immediately write the line with the Accession ID to a new file, fasta_parsed.txt, and set/reset the skip 'flag' variable to 0. The else statement containing the if not skip segment will then allow subsequent lines associated with the Accession ID that we found to be printed to the fasta_parsed.txt file.

对于在字典中找不到的文件1 中的登录ID(不在文件2 中),我们不将行写到fasta_parsed.txt,而是将skip标记为0.因此,直到在文件2 中存在的文件1 中找到另一个登录ID,所有后续行都将被跳过.

For Accession ID from File 1 not found in the dictionary (not in File 2), we don't write the line to fasta_parsed.txt and we set the skip flag to 0. Thus, until another Accession ID is found in File 1 that exists in File 2, all subsequent lines will be skipped.

这篇关于根据单独文件中的条目从FASTA文件中提取序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆