Biopython教程

Biopython - 序列I/O操作

Biopython提供了一个模块，Bio.SeqIO分别从文件(任何流)读取和写入序列.它支持生物信息学中几乎所有可用的文件格式.大多数软件为不同的文件格式提供不同的方法.但是，Biopython有意识地遵循单一方法通过其SeqRecord对象向用户呈现解析的序列数据.

让我们在下一节中了解有关SeqRecord的更多信息.

SeqRecord

Bio.SeqRecord模块提供SeqRecord来保存序列的元信息以及序列数据本身，如下面给出的 :

seq : 这是一个实际的序列.
id : 它是给定序列的主要标识符.默认类型为字符串.
name : 它是序列的名称.默认类型为字符串.
description : 它显示有关序列的人类可读信息.
注释 : 它是关于序列的附加信息的字典.

可以按照以下指定导入SeqRecord

来自Bio.SeqRecord导入的

 
 SeqRecord

让我们了解使用实际序列解析序列文件的细微差别文件在接下来的部分.

解析序列文件格式

本节介绍如何解析两种最流行的序列文件格式， FASTA 和 GenBank .

FASTA

FASTA 是最多的用于存储序列数据的基本文件格式.最初，FASTA是用于生物信息学早期进化过程中开发的DNA和蛋白质序列比对的软件包，主要用于搜索序列相似性.

Biopython提供了一个示例FASTA文件，它可以可在 https://github.com/biopython/biopython/blob/master/Doc/examples访问/ls_orchid.fasta.

将此文件下载并保存到Biopython示例目录中'orchid.fasta'.

Bio.SeqIO模块提供了处理序列文件的parse()方法，可以按照以下方式导入;

from Bio.SeqIO import parse

parse()方法包含两个参数，第一个是文件句柄，第二个是文件格式.

 
>>> file = open('path/to/biopython/sample/orchid.fasta')
>>> for parse in parse(file ，"fasta"):
 ... print(record.id)
 ... 
 gi | 2765658 | emb | Z78533.1 | CIZ78533 
 gi | 2765657 | emb | | Z78532.1 | CCZ78532 
 .......... 
 .......... 
 gi | 2765565 | emb | Z78440.1 | PPZ78440 
 gi | 2765564 | emb | Z78439.1 | PBZ78439 
>>>

这里，parse()方法返回一个可迭代对象每次迭代都会返回SeqRecord.可迭代，它提供了许多复杂和简单的方法，让我们看到一些功能.

next()

next( )方法返回可迭代对象中可用的下一个项目，我们可以使用它来获得下面给出的第一个序列 :

 
>>> first_seq_record = next(SeqIO.parse(open('path/to/biopython/sample/orchid.fasta')，'fasta'))
>>> first_seq_record.id'gi | 2765658 | emb | Z78533.1 | CIZ78533'
>>> first_seq_record.name'gi | 2765658 | emb | Z78533.1 | CIZ78533'
>>> first_seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG ... CGC'，SingleLetterAlphabet())
>>> first_seq_record.description'gi | 2765658 | emb | Z78533.1 | CIZ78533 C.irapeanum 5.8S rRNA基因和ITS1和ITS2 DNA'
>>> first_seq_record.annotations 
 {} 
>>>

这里，seq_record.annotations为空，因为FASTA格式不支持序列注释.

list comprehension

我们可以使用列表推导将可迭代对象转换为列表，如下所示

 
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta')，'fasta')
>>> all_seq = [seq_record在seq_iter中的seq_record]>>> len(all_seq)
 94 
>>>

这里，我们使用len方法来获取总计数.我们可以获得具有最大长度的序列，如下所示;

 
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta')，'fasta')
>>> max_seq = max(seq_record中的seq_record的len(seq_record.seq))
>>> max_seq 
 789 
>>>

我们也可以使用以下代码过滤序列 :

 
>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta')，'fasta')
>>> seq_under_600 = [seq_record seq_iter中的seq_record如果len(seq_record.seq)< 600] 
>>> for seq in seq_under_600:
 ... print(seq.id)
 ... 
 gi | 2765606 | emb | Z78481.1 | PIZ78481 
 gi | 2765605 | emb | Z78480.1 | PGZ78480 
 gi | 2765601 | emb | Z78476.1 | PGZ78476 
 gi | 2765595 | emb | Z78470.1 | PPZ78470 
 gi | 2765594 | emb | Z78469.1 | PHZ78469 
 gi | 2765564 | emb | Z78439.1 | PBZ78439 
>>>

将SqlRecord对象(解析数据)的集合写入文件就像调用SeqIO.write方法一样简单，如下所示;

 
 file = open("converted.fasta"，"w)
 SeqIO.write(seq_record，file，"fasta")

此方法可以有效地用于转换下面指定的格式 :

 
 file = open("converted.gbk"，"w)
 SeqIO.write(seq_record，file，"genbank")

GenBank

它是一种更丰富的基因序列格式，包括各种注释的字段. Biopython提供了一个示例GenBank文件，可以通过 https://github.com/biopython/访问它. biopython/blob/master/doc/examples/ls_orchid.fasta.

将文件下载并保存到Biopython示例目录中'orchid.gbk'

因为，Biopython提供单一功能，解析所有生物信息学格式.解析GenBank格式就像在解析方法中更改格式选项一样简单.

下面给出了相同的代码 :

>>> from Bio import SeqIO 
>>> from Bio.SeqIO import parse 
>>> seq_record = next(parse(open('path/to/biopython/sample/orchid.gbk'),'genbank')) 
>>> seq_record.id 
'Z78533.1' 
>>> seq_record.name 
'Z78533' 
>>> seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) 
>>> seq_record.description 
'C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' 
>>> seq_record.annotations {
   'molecule_type': 'DNA', 
   'topology': 'linear', 
   'data_file_division': 'PLN', 
   'date': '30-NOV-2006', 
   'accessions': ['Z78533'], 
   'sequence_version': 1, 
   'gi': '2765658', 
   'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 
   'source': 'Cypripedium irapeanum', 
   'organism': 'Cypripedium irapeanum', 
   'taxonomy': [
      'Eukaryota', 
      'Viridiplantae', 
      'Streptophyta', 
      'Embryophyta', 
      'Tracheophyta', 
      'Spermatophyta', 
      'Magnoliophyta', 
      'Liliopsida', 
      'Asparagales', 
      'Orchidaceae', 
      'Cypripedioideae', 
      'Cypripedium'], 
   'references': [
      Reference(title = 'Phylogenetics of the slipper orchids (Cypripedioideae:
      Orchidaceae): nuclear rDNA ITS sequences', ...), 
      Reference(title = 'Direct Submission', ...)
   ]
}