如何使用Perl从NCBI获得FASTA核苷酸格式的基因特征? [英] How do I get gene features in FASTA nucleotide format from NCBI using Perl?

查看:140
本文介绍了如何使用Perl从NCBI获得FASTA核苷酸格式的基因特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能够手动下载如下所示的FASTA文件:

I am able to download a FASTA file manually that looks like:

>lcl|CR543861.1_gene_1...
ATGCTTTGGACA...
>lcl|CR543861.1_gene_2...
GTGCGACTAAAA...

通过单击发送到并选择基因特征,FASTA核苷酸是此页面

by clicking "Send to" and selecting "Gene Features", FASTA Nucleotide is the only option (which is fine because that's all I want) on this page.

使用如下脚本:

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::DB::EUtilities;

my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                       -db      => 'nucleotide',
                                       -id      => 'CR543861',
                                       -rettype => 'fasta');
my $file = 'CR543861.fasta';
$factory->get_Response(-file => $file);

我得到的文件看起来像:

I get a file that looks like:

>gi|49529273|emb|CR543861.1| Acinetobacter sp. ADP1 complete genome
GATATTTTATCCACA...

,整个基因组序列集中在一起。 如何获取第一个文件(手动下载的文件)中的信息?

with the whole genomic sequence lumped together. How do I get information like in the first (manually downloaded) file?

我还看了其他几篇文章:

I looked at a couple of other posts:

  • how to download complete genome sequence in biopython entrez.esearch (this answer seemed relevant)
  • How can I download the entire GenBank file with just an accession number?

以及本书摘自EUtil>

As well as this section from EUtilities Cookbook.

我尝试获取一个nd保存一个GenBank文件(因为似乎我得到的.gb文件中每个基因都有单独的序列),但是当我使用Bio :: SeqIO进行操作时,我只会得到一个大序列。

I tried fetching and saving a GenBank file (since it seems to have separate sequences for each gene in the .gb file I get), but when I go work with it using Bio::SeqIO, I will get only 1 large sequence.

推荐答案

有了该登录号和返回类型,您将获得完整的基因组序列。如果要获取单个基因序列,请指定您想要完整的基因库文件,然后解析基因。例如:

With that accession number and return type, you are getting the complete genome sequence. If you want to get the individual gene sequences, specify that you want the complete genbank file, then parse out the genes. Here is an example:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::EUtilities;


my $factory = Bio::DB::EUtilities->new(-eutil   => 'efetch',
                                       -email   => 'foo@bar.com',
                                       -db      => 'nucleotide',
                                       -id      => 'CR543861',
                                       -rettype => 'gb');
my $file = 'CR543861.gb';
$factory->get_Response(-file => $file);

my @gene_features = grep { $_->primary_tag eq 'gene' } 
                    Bio::SeqIO->new(-file => $file)->next_seq->get_SeqFeatures;

for my $feat_object (@gene_features) {
    for my $tag ($feat_object->get_all_tags) {
        # open a filehandle here for writing each to a separate file
        say ">",$feat_object->get_tag_values($tag);
        say $feat_object->spliced_seq->seq;
        # close it!
    } 
}

这会将每个基因写入同一文件(如果您重定向了它,现在它只写到STDOUT了),但我指出了可以进行一些小的更改以将它们写到单独的文件中的地方。有时,解析genbank可能会有些棘手,因此阅读文档总是有帮助的,尤其是出色的功能注释操作

This will write each gene to the same file (if you redirect it, now it just writes to STDOUT) but I indicated where you could make a small change to write them to separate files. Parsing genbank can be a bit tricky at times, so it is always helpful to read the docs and in particular, the excellent Feature Annotation HOWTO.

这篇关于如何使用Perl从NCBI获得FASTA核苷酸格式的基因特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆