Perl:在数组元素中搜索模式 [英] Perl: Search a pattern across array elements
问题描述
我是Perl的新手,仍然遇到另一个需要一些帮助和投入的生物信息学问题。
问题简短:
-
我有一个文件,其中包含40,000多个 unique DNA序列。唯一是指唯一的序列ID。我会在文章结尾处附加一部分内容,以帮助您显示它的外观。
-
我需要将 3个部分。因此,如果特定序列的长度为999个字符,则3个部分中的每个部分都将具有333个字符。
-
我需要寻找以下模式通过3个单独的部分:
$ gpat = [G] {3,5};
$ npat = [AZ] {1,25};
$ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;
-
如果$ pattern出现在3个部分的第一个部分,则增加$ beginning的计数器,如果$ pattern出现在3个部分的第2个部分部分,增加'middle'的计数器,最后,如果$ pattern在第3部分中出现,则增加'end'的计数器。
-
打印开始,中间和结束的计数器,即基本上是每个序列的开始,中间,结束的总和。
在第一个序列中说,其值分别类似于'2','5','3',并且在第二序列中,值为'4','1','6',最终计数应为'7,6,9'。
我遇到的问题:
- 如果将特定序列分为3部分,则可能的$ pattern将丢失。例如,在类似这样的序列上说:每35个字符的长度:
gggatgtcgatgcatggggatgcatcgatgcgggg
em> $ pattern分为前两个部分。无论如何说如果$ pattern从第一部分开始到第二部分结束,则增加开始计数。 ?
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatat $$
## UPDATE ## 以下问题已得到解决,这归功于Cupidvogel建议的代码
2。如果序列的长度不能被3整除,如何将其分为3个部分?我尝试使用
int
,但是最后一部分是1-2个
字符。
以下是我到目前为止编写的代码。
它读取文件,显示标题名称和序列,每个序列将被划分为的长度,最后是序列分为3个部分,只要序列长度可以被3整除,就可以很好地工作;对于不能被3整除的序列,最后的第3部分要短1-2个字符。
#从用户
打印中获取文件名请输入文件名:
$ in =<> ;;
chomp $ in;
开放(FASTA, $ in)或死亡;
而(< FASTA>)
{
$ / =>;
@array = split'\n',$ _;
$ header = shift @array; #Fasta序列的标头
print \n\nNextNext:\n;
打印$ header, \n;
$ seq =加入’,@ array; #序列
$ seq =〜s / \s // g;
$ seq =〜s / \ * // g;
$ seq =〜s /> // g;
print $ seq, \n\n;
$ num = int(length($ seq)/ 3);
@arr = unpack( A $ num A $ num A *,$ seq);
打印"新方法给出了这个:, @ arr;
打印 \n第一个元素是:,$ arr [0];
print \n第二个元素是:,$ arr [1];
打印 \n第三个元素是:,$ arr [2];
#下面的代码行最初是为了拆分...
#...将序列分成3部分,尽管未成功
#my $ split =(长度$ seq)/ 3;
#print $ split, \n\n;
#my $ int = int $ split;
#print $ int, \n\n;
#my @ array2 = $ seq =〜/(.{$int})/g;
#print join(,,@ array2), \n\n;
#print $ array2 [0], \n,$ array2 [1], \n,$ array2 [2];
}
出口;
我一直在尝试使用以下示例文件编写的代码:sample.fa
'pre>> ABC_123 2
atgtcgatcgatcggcgggcatgcgcgcgcggatg
atatatagcgcgcgctatatagcgcgactctacgc
atgctgctgactagctatagtcgctgactgcgcgt
gggaaaaagggcccgggccccgttttggggatcta
ggggatagctgatgctagcatgcatgctgactgca
个DEF_456 4
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
个GHI_789 1个
atagctgctagtcgatcggcgcgggtatcgatcgg
ggatcgatcgatcggggatcgatcgggggatcgat
实际的输入文件如下所示:
> NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tg猫b agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccg海湾合作委员会
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcat aatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcg agctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
AAA
> NM_198399 1
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcagg AG
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttcttta catgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgctt ttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
个NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
个NR_027917 1个
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaac gcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
个NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactg tagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaa gatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa
个NR_033769 1
ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
ccccacttcatgcagtggc cttcatgaaggccctcatgaaggattcccca
cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
aagttaggtttatagagtttgactagttttttcgattagatttgtattag
ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
aaagaaaagaaaaacagagacggtc
个NM_016326 3
atgcgcgcaagagag cgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
tttgtaattttatattactttttagtttgatactaagtattaaacatatt
tctgtattcttccacatattttctgcagttattttaactcagtataggag
ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
atctctttactgcctggctggccggcagctccg
个NM_181641 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
TGA ccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
aaacatatttctgtattcttccacatattttctgcagttattttaactca
gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
gatctttgaatctctttactgcctggctggccggcagctccg
个NM_001144931 1
gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
gaccggtgcgcaggggtagtaggcccggaatatta ttctaaaacacaatc
agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
ttgggaggccgag
个NR_029429 1
ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
cttcctgc
> NR_026551 1
tgtggcct gagaggacggccaggactggccagaaaagagagggacgtggc
taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
gagcagagccctcactcccaggcagagttgtctgaatccttcct
个NM_181640 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
taattttatattactttttagtttgatactaagtattaaacatatttctg
tattcttccacatattttctgcagttattttaactcagtataggagctag
aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
ctttactgcctggctggccggcagctccg
个NM_016951 3
atgcgcgcaagagagcgggaagccgagctgggc gagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
gaagttttgtaattttatattactttttagtttgatactaagtattaaac
atatttctgtattcttccacatattttctgcagttattttaactcagtat
aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
tttgaatctctttactgcctggctggccggcagctccg
个NR_002773 1
cagcaccacaccagga ccctccagaggctgtgagaaacatcctgcaccca
ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
cccgctgg actatccagaaggtgttctatcaaggccgctactatgacagc
ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
cactccatcctgcgtgactgaac
个NR_037806 1
attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
gattaaaatcacactaataacccctggatggtcaatctgataataggatc
agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
agattattcccagtctcccagtaacacgtttctacccagatcctttttca
tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
attcactttcaaaccactttcagtaacagcaaattctttagaaaag gaaa
atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
gccctaccttgaaatccatcaggtctgcgttggacbggggbb
感谢您抽出宝贵的时间解决我的问题!
解决方案我没有将序列分为三部分,而是通过查找完整序列中所有出现的
$ pattern
并确定模式的哪三分之一来实现的
内置变量
$-[0]
包含最近的开始的偏移量成功匹配。
下面的代码可以满足我的要求。它通过累加每个序列(在找到新序列ID或到达文件末尾时结束)并将其传递给
process_seq
子例程来工作。
该子例程获取序列的长度,并计算字符串每三分之一结尾的偏移量。惯用的
sprintf'%.0f',$ value
用于将小数值取整到最接近的字符位置。
对于序列中每次出现的
$ regex
,都会调整@counts
数组。@counts
要增加的元素是通过比较匹配项在$-[0]
中的开始位置来确定的
处理完每个序列后,
@counts $ c中的值$ c>被累加到
@totals
中,以给出所有序列的总体数字。
程序在以下情况下的输出显示使用示例数据。总计为
(9,1,6)
。使用严格;
使用警告;
我的$ gpat =‘[G] {3,5}’;
我的$ npat =‘[A-Z] {1,25}’;
my $ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;
我的$ regex = qr / $ pattern / i;
打开我的$ fh,<, sequences.txt或死亡$ !;
my($ id,$ seq);
my @totals =(0,0,0);
而(< $ fh>)){
chomp;
if(/ ^>(\w +)/){
process_seq($ seq)如果$ id;
$ id = $ 1;
$ seq =’;
打印 $ id\n;
}
elsif($ id){
$ seq。= $ _;
process_seq($ seq)如果eof;
}
}
打印 Total:@ totals\n;
sub process_seq {
my $ sequence = shift;
my $ length =长度$ sequence;
my @offsets =地图{sprintf'%.0f',$ length * $ _ / 3} 1..3;
my @counts =(0,0,0);
而($ sequence =〜/ $ regex / g){
我的$ place = $-[0];我的$ i的
(0..2){如果$ place> = $ offsets [$ i],则
;
$ counts [$ i] ++;
最后;
}
}
打印 @ counts\n\n;
$ totals [$ _] + = $ counts [$ _]为0..2;
}
输出
NR_037701
0 0 1
NM_198399
1 0 0
NR_026816
1 0 1
NR_027917
0 0 0
NR_002777
0 0 0
NR_033769
1 0 0
NM_016326
1 0 1
NM_181641
1 0 1
NM_001144931
0 0 0
NR_029429
0 1 0
NR_026551
1 0 0
NM_181640
1 0 1
NM_016951
1 0 1
NR_002773
1 0 0
NR_037806
0 0 0
总计:9 1 6
I am a Perl newbie, stuck with another bioinformatics problem that requires some help and input.
The problem in brief:
I have a file, which has over 40,000 unique DNA sequences. By unique, I mean unique sequence id. I am attaching a portion of it at the end of my post to help you show what it looks like.
I need to divide each of the 40,000 sequences into 3 parts. So if a particular sequence is 999 character long, each of the 3 parts would have 333 characters.
I need to look for the following pattern through each of the 3 individual parts:
$gpat = [G]{3,5}; $npat = [A-Z]{1,25};
$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;If $pattern appears in the first of the 3 parts, increase the counter of 'beginning', if $pattern occurs in the 2nd of the 3 parts, increase counter of 'middle' and lastly if the $pattern appears in the 3rd part, increase counter of 'end'.
Print the counters of 'beginning','middle' and 'end' i.e basically summation of 'beginning','middle','end' for each of the sequences.
Say in 1st sequence, the values are like '2','5','3' respectively and in 2nd sequence, the values are '4','1','6', the final count should be '7,6,9'.
The issues I am having:
- If a particular sequence is split into 3 parts, potential $pattern is lost. eg say on a sequence like :
gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta
a split into 3 parts produces following 3 sub-parts,each of 35 character length:
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatattaHence, $pattern gets split into the first 2 parts. Is there anyway to say "If $pattern begins in 1st part and ends in 2nd part", increase count of "beginning" ?
##UPDATE## The following issue has been resolved thanks to the code suggested by Cupidvogel
2.How do I divide a sequence into 3 parts if its length is not divisible by 3? I tried using
int
, but then the last part is 1-2 characters short.The following is the code I have written so far.
It reads in the file, displays the header name and sequence, the length into which each sequence will be divided and finally the sequence split into 3 parts which works fine provided the sequence length is divisible by 3, for those which aren't, the final 3rd part is 1-2 characters short.
#Take Filename from user print "Please enter file name : "; $in =<>; chomp $in; open (FASTA,"$in") or die ; while (<FASTA>) { $/=">"; @array = split '\n', $_; $header=shift @array; # Header of the fasta sequence print "\n\nNext sequence: \n"; print $header,"\n"; $seq= join '', @array; # sequence $seq=~s/\s//g; $seq=~s/\*//g; $seq=~s/>//g; print $seq,"\n\n"; $num = int(length($seq)/3); @arr = unpack("A$num A$num A*",$seq); print " New method gives this :", @arr; print "\nThe first element is :", $arr[0]; print "\nThe second element is :",$arr[1]; print "\nThe third element is :",$arr[2] ; #The following lines of code were originally written to split... #...the sequence into 3 parts, albeit unsuccessfully #my $split = (length $seq)/3; #print $split,"\n\n"; #my $int = int $split; #print $int,"\n\n"; #my @array2 = $seq =~ /(.{$int})/g; #print join (" ", @array2),"\n\n"; #print $array2[0],"\n",$array2[1],"\n",$array2[2]; } exit;
I have been trying the code I have written so far with the following sample file : sample.fa
>ABC_123 2 atgtcgatcgatcggcgggcatgcgcgcgcggatg atatatagcgcgcgctatatagcgcgactctacgc atgctgctgactagctatagtcgctgactgcgcgt gggaaaaagggcccgggccccgttttggggatcta ggggatagctgatgctagcatgcatgctgactgca >DEF_456 4 gggatgtcgatgcatggggatgcatcgatgcgggg actagctagcgggatgctacgatggggatgatgat aatatcgcggcgcatatatgctagtctatatatta >GHI_789 1 atagctgctagtcgatcggcgcgggtatcgatcgg ggatcgatcgatcggggatcgatcgggggatcgat
The actual input file looks like the following:
>NR_037701 1 aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa agctgtagttatggctggtggagttcagttagtcagcatctggtggagct gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg gagggaggagtacagacatggaattttaattctgtaatccagggcttcag ttatgtacaacatccatgccatttgatgattccaccactccttttccatc tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt ccattatcagtccctgcaattctatttttcttccttctctacacagcccc tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca gccctatgtggattagcaagttaagtaatgacactcagagacagttccat ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg gctccctcttttaaagattttccttccctctttccaactccctgggtcct ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca cacagactcaaaccctctctcacacacatacacatatacattgttattcc acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg agggttgggacttcaacacagctttttgggggatcataattcaacccatg acagccactgagattattatatctccagagaataaatgtgtggagttaaa aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag ggaggggattgaactagacacagacacatgagcaggactttggggagtgt gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc actgttattagatattgtatgtctttgtgtccttttattcatgaattctt gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg gccaatcggagattcgtttcttatctataatagacatctgagcccctggc ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaa >NM_198399 1 aacagattttaactctgaaaagccatttccagtgtctatagactattgtg agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg caatagcagaggaaggagggactgagcaggagacggccactccagagaac ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc actgcatatttttacccttatttttgctccttacagcaagattagtaggt tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa attgattaagatatcattatttttgtttggtttggttttgcttttttcct cttactttaattgaaatactctgaattcccctcatggaaacagagagcat tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg tcctaagagagtgtttttttttctagcatcattttctttacatgccactc atgtcataaggcatggacaggctatctttcagtggccattactatgtttc gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg atataaacttatcctgtaccaatgtatttattaacacttgtattttatta ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa aaaaaaaaaaaaa >NR_026816 1 caacccactctctgtgctatgacttcattactctttcccagcccagccct gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa ccagagccaaggctacagctagagagttgactcctctatttgagattgac aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa >NR_027917 1 atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa ccattattttgcttccagtatgttgccgacaatggaggcctggactctga ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca ggtgtag >NR_002777 3 cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg agcctaatagagaacataaaattctaaaagataaagataataataatgat aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa cagccacctagccatttcccattaaatataatcccatcagcagcagacaa tatctatcctcccctatcccctctatccatatttggaaactgcaccctct tccctatttagcaccctaacaccacttgaattccataaccctgttgttga tctagctctcctcacctctaaacacttctagcattcctttcagatcagga gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa actcataggacataaaaaaaaaaaaaa >NR_033769 1 ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc agctcagccagccacagaactggaatttttcaggagcagggggagcatgg agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg tggagctggtgcctccagagagccctttgatccagctcttcttggagaga gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc aagttaggtttatagagtttgactagttttttcgattagatttgtattag ttataaatttgttcatagagtttgactaattttttcgattagatttgtat ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa taacagctattgttttgcatatccactgcaggccaagcactttcagcatc atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa aaagaaaagaaaaacagagacggtc >NM_016326 3 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt tttgtaattttatattactttttagtttgatactaagtattaaacatatt tctgtattcttccacatattttctgcagttattttaactcagtataggag ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga atctctttactgcctggctggccggcagctccg >NM_181641 2 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa aaaagaagttttgtaattttatattactttttagtttgatactaagtatt aaacatatttctgtattcttccacatattttctgcagttattttaactca gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg gatctttgaatctctttactgcctggctggccggcagctccg >NM_001144931 1 gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact ttgggaggccgag >NR_029429 1 ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta cttcacctgccagccaacccgccagtattgtggagagctcatccttggag gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag gccactggcttgtgctctgagggttgccaggccattgtggataccgagac cttcctgc >NR_026551 1 tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct gagcagagccctcactcccaggcagagttgtctgaatccttcct >NM_181640 2 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg taattttatattactttttagtttgatactaagtattaaacatatttctg tattcttccacatattttctgcagttattttaactcagtataggagctag aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct ctttactgcctggctggccggcagctccg >NM_016951 3 atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag acttgatcgattaatgaagtggttattttggcctttgcttgatattatca actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa gaagttttgtaattttatattactttttagtttgatactaagtattaaac atatttctgtattcttccacatattttctgcagttattttaactcagtat aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc tttgaatctctttactgcctggctggccggcagctccg >NR_002773 1 cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc tttctgacccagcagctggggccagggctggtggatgcagcccaggccca gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg ctgcagccctggctcacttggacagggggagccccccacctgcccgggag gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc caagagtacctggacatagaccagatgatcttcgacagagagctgcccca ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg cttgggccacttctccacgcccctgacccatggggtggactgcccctacc tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct gcggcgacaccactcagatctctactcccactactttgggggccttgcgg aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg cactccatcctgcgtgactgaac >NR_037806 1 attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac cctccagtgttggcacaggcccacccctggctccaccagagccagaagca gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag gattaaaatcacactaataacccctggatggtcaatctgataataggatc agatttacgtctaccctaattcttaacattgcagctttctctccatctgc agattattcccagtctcccagtaacacgtttctacccagatcctttttca tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg ggattagctctg
Any help and input would be deeply appreciated.
Thank you for taking the time to go through my problem!
解决方案Rather than splitting the sequence into three parts, the way I see this working is to find all occurrences of
$pattern
in the complete sequence and determine in which third the pattern starts.The built-in variable
$-[0]
contains the offset of the start of the most recent successful match.The code below does what I think you want. It works by accumulating each sequence (which ends either when a new sequence ID is found or the end of file is reached) and passing it to the
process_seq
subroutine.The subroutine takes the length of the sequence and caclulates the offset of the end of each third of the string. The idiomatic
sprintf '%.0f', $value
is used to round fractional values to the nearest character position.The
@counts
array is adjusted for each occurrence of$regex
in the sequence. The element of@counts
to be incremented is established by comparing the starting position of the match in$-[0]
with the end offset of each of the three segments of the sequence.Once each sequence has been processed the values in
@counts
are accumulated into@totals
to give overall figures for all sequences.The output of the program when using your sample data is shown. The grand total is
(9, 1, 6)
.use strict; use warnings; my $gpat = '[G]{3,5}'; my $npat = '[A-Z]{1,25}'; my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; my $regex = qr/$pattern/i; open my $fh, '<', 'sequences.txt' or die $!; my ($id, $seq); my @totals = (0, 0, 0); while (<$fh>) { chomp; if (/^>(\w+)/) { process_seq($seq) if $id; $id = $1; $seq = ''; print "$id\n"; } elsif ($id) { $seq .= $_; process_seq($seq) if eof; } } print "Total: @totals\n"; sub process_seq { my $sequence = shift; my $length = length $sequence; my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3; my @counts = (0, 0, 0); while ($sequence =~ /$regex/g) { my $place = $-[0]; for my $i (0..2) { next if $place >= $offsets[$i]; $counts[$i]++; last; } } print "@counts\n\n"; $totals[$_] += $counts[$_] for 0..2; }
output
NR_037701 0 0 1 NM_198399 1 0 0 NR_026816 1 0 1 NR_027917 0 0 0 NR_002777 0 0 0 NR_033769 1 0 0 NM_016326 1 0 1 NM_181641 1 0 1 NM_001144931 0 0 0 NR_029429 0 1 0 NR_026551 1 0 0 NM_181640 1 0 1 NM_016951 1 0 1 NR_002773 1 0 0 NR_037806 0 0 0 Total: 9 1 6
这篇关于Perl:在数组元素中搜索模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!