读取.fasta序列以提取核苷酸数据,然后写入TabDelimited文件 [英] Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

查看:111
本文介绍了读取.fasta序列以提取核苷酸数据,然后写入TabDelimited文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我继续之前,我想向读者介绍我以前对Perl所遇到的问题,这是所有这些的初学者.

Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this.

这些是我过去几天中按时间顺序排列的帖子:

These were my posts over the past few days, in chronological order:

  1. 如何从制表符分隔的数据中平均列值... (已解决)
  2. 为什么我看不到任何计算结果在我的输出文件中? (已解决)
  3. 使用.fasta文件来计算相对内容序列
  1. How do I average column values from a tab-separated data... (Solved)
  2. Why do I see no computed results in my output file? (Solved)
  3. Using a .fasta file to compute relative content of sequences

现在,如上所述,在你们中的一些人的帮助下,我设法找出了前两个查询,并且我从中学到了很多.我真的很感激对于一个对此一无所知但仍然觉得自己不了解的人,帮助实际上是一个天赐之物.

Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've really learnt from it. I'm truly grateful. For a person who knows nothing about this, and still feels like he doesn't, the help was practically a Godsend.

最后一个查询仍未解决,这是继续.我确实看过一些推荐的文字,但是由于我想在星期一之前完成此操作,因此不确定我是否完全忽略了任何内容.无论哪种方式,我都可以尝试执行该任务.

The last query remains unsolved and this is a continuation. I did have a look at some of the recommended texts, but as I'm trying to get this finished before Monday, I'm unsure if I've overlooked anything completely. Either way, I have had a go at attempting the task.

就这么知道,任务是打开并读取.fasta文件(我想我终于把东西钉好了,哈利路亚!),读取每个序列计算相对的G + C核苷酸含量,然后写入TABDelimited文件以及基因名称及其各自的G + C含量.

Just so you know, the task is to open and read a .fasta file (I think I've finally nailed something pretty well, hallelujah!), read each sequence, compute the relative G+C nucleotide content, and then write to a TABDelimited file and the names of the genes and their respective G+C content.

即使我已经尝试过了,但是我知道我还没有准备好执行该程序来提供我想要的结果,这就是为什么我再次联系你们以获得一些指导或如何执行此操作的示例.与我以前解决的查询一样,我希望它采用与已经完成的查询类似的样式-即使这可能不是最方便/最有效的方式.它只是让我知道我在每个步骤中都在做什么,即使我似乎在垃圾邮件中也是如此!

Even though I've had a go at attempting this, I know that I am no where near ready to execute the program to provide the results that I'm after, which is why I'm reaching out to you guys again for some guidance, or examples of how to go about this. As with my previous, solved queries, I'd like it to be in a similar style to what I've already done them in - even though it might not be the most convenient/efficient way. It just allows me to know what I'm doing each step of the way, even though it seems like I'm spamming it up!

无论如何,.fasta文件读取如下内容:

Anyway, the .fasta file reads something like:

>label
sequence
>label
sequence
>label
sequence

我不确定如何打开.fasta文件,因此我不确定哪些标签适用于哪个标签,但是我知道基因应该标为gagpolenv.我是否需要打开.fasta文件来了解自己在做什么,还是可以通过上述格式来盲目地"完成操作?

I'm unsure how to open the .fasta file, so I'm not sure what labels apply to which, but I know that the genes should be labelled either gag, pol, or env. Do I need to open the .fasta file to know what I'm doing, or can I do it 'blindly' by going with the above format?

这可能是很明显的,但是我仍然在为这一切而苦苦挣扎.我感觉我现在应该已经流行了!

It may be perfectly obvious, but I'm still struggling with all of this. I'm feeling like I should have caught on by now!

无论如何,我当前拥有的代码如下:

Anyway, the current code I have is as follows:

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    foreach my $line ($infile) {
        if($line = ~/^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
            next;
        } elsif($line = ~/^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
            next; 
        } elsif($line = ~/^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
            next;
        } else {
            $sequence = $line;
        }
    }
    {
        $sequence =~ s/\s//g;               # Whitespace characters are removed
        return $sequence;
    }

我不确定这里是否正确,但是执行它会给我留下语法错误,第35行(最后一行之后,因此没有任何内容!).它在"EOF"说.这就是我所能指出的.否则,我将试图弄清楚如何计算每个序列中G + C核苷酸的数量,然后在输出的.txt文件中将其正确制成表格.我相信这就是TABDelimited文件的意思吗?

I'm not sure if anything's correct here, but executing it left me with a syntax error ar line 35 (beyond the last line, and hence there isn't anything there!). It said at 'EOF'. That's about all I can point out. Otherwise I'm trying to figure out how to compute the quantities of the nucleotides G + C in each of the sequences, and then tabulating this properly in an output .txt file. I believe that's what is meant by a TABDelimited file?

无论如何,对于此查询似乎过长,愚蠢"或重复,我深表歉意,但我无法找到与该查询直接相关的任何信息,因此,我们将不胜感激,以及每个步骤的说明(如果可能)!

In any case, I apologise if this query seems to be too lengthy, 'dumb' or a repeat, but in saying that, I couldn't find any information directly pertaining to this, so your help would be much appreciated, and the explanations for each step too if possible!!

亲切的.

推荐答案

在结尾处有一个额外的括号.这应该起作用:

You have an extra brace right near the end. This should work:

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ /^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

我还编辑了您的退货行. Return将退出您的循环.我怀疑您想要将其打印到文件中,所以我已经做到了.您可能需要先进行一些进一步的转换,才能将其转换为制表符分隔的格式.

Also I edited your return line. Return will exit your loop. I suspect what you want is to print it to a file, so I have done that. You may need to do some further transformation first to get it into a tab separated format.

这篇关于读取.fasta序列以提取核苷酸数据,然后写入TabDelimited文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆