为DNA序列创建数组的哈希,Perl [英] Creating a hash of arrays for DNA sequences, Perl

查看:236
本文介绍了为DNA序列创建数组的哈希,Perl的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个称为%id2seq的哈希,其中包含键$id引用的DNA序列字符串.我希望能够通过使用字符串中的位置作为参考来操纵DNA序列.例如,如果我的DNA序列是ACGTG,则我的$id将是Sequence 1,我的$id2seq{'Sequence 1'}将是ACGTG,而我的理论" $id2seq{'Sequence 1'}[3]将是G. 我试图创建一个数组的哈希来做到这一点,但是我得到一个奇怪的输出(见下面的输出).我很确定这只是我的格式.任何输入都会有所帮助,我先感谢您.

I have a hash called %id2seq that contains strings of DNA sequences that are referenced by the key $id. I want to be able to manipulate the DNA sequences by using a position within the string as a reference. For example, if my DNA sequence was ACGTG, my $id would be Sequence 1, my $id2seq{'Sequence 1'} would be ACGTG, and my "theoretical" $id2seq{'Sequence 1'}[3] would be G. I am attempting to create a hash of arrays to do this, but I'm getting a weird output (see below output). I'm pretty sure that it's just my formatting Any input is helpful, and I appreciate in advance.

以下是输入文件的摘要:

Here is a snippet of the input file:

>Sequence 1
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
>Sequence 2
CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG
>Sequence 3
TCGACCCTCTGGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

这是我目前的尝试的摘要. (我有一个哈希表,访问带有注释掉的DNA序列的文件):

Here is a snippet of my attempt at the moment. (I have a hash table that accesses a file with the DNA sequences commented out):

use strict;
use warnings;

print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;

#Remove newline from file
chomp $filename1;

#Open the file and store each dna seq in hash
my %id2seq = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
    if($_ =~ /^>(.+)/)
    {
        $id = $1;
    }
    else
    {
        ## $id2seq{$id} = $_; used to create hash table
        @seqs = split '', $_;
        $id2seq{$id} = [ @seqs ];
    }
}
close FILE;
foreach $id (keys %id2seq)
{
    print "$id2seq{$id}[@seqs]\n\n";
}

输出

Use of unitialized value in concatenation (.) or string at line 37.


T

G

A

T

T

推荐答案

@seqs包含最后一个序列中的字符. $id2seq{$id}[@seqs]实际上表示$id2seq{$id}[N],其中N是最后一个序列的长度.因此,您只能从每个序列中打印一个字符,如果该序列比最后一个序列短,则会收到警告.

@seqs contains characters from the last sequence. $id2seq{$id}[@seqs] actually means $id2seq{$id}[N] where N is the length of the last sequence. So you print only one character from each sequence and get a warning if that sequence is shorter than the last one.

如果print仅用于调试,则使用以下命令会更容易:

If you print only for debugging it is easier with:

use Data::Dumper;
print Dumper(\%id2seq);

否则,您必须在嵌套循环中遍历$id2seq{$id}自己.

Otherwise you have to iterate over $id2seq{$id} yourself in a nested loop.

这篇关于为DNA序列创建数组的哈希,Perl的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆