Perl:在数组元素中搜索模式 [英] Perl: Search a pattern across array elements

查看:66
本文介绍了Perl:在数组元素中搜索模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Perl的新手,仍然遇到另一个需要一些帮助和投入的生物信息学问题。


问题简短:


  1. 我有一个文件,其中包含40,000多个 unique DNA序列。唯一是指唯一的序列ID。我会在文章结尾处附加一部分内容,以帮助您显示它的外观。



  2. 我需要将 3个部分。因此,如果特定序列的长度为999个字符,则3个部分中的每个部分都将具有333个字符。



  3. 我需要寻找以下模式通过3个单独的部分:


    $ gpat = [G] {3,5};
    $ npat = [AZ] {1,25};

    $ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;



  4. 如果$ pattern出现在3个部分的第一个部分,则增加$ beginning的计数器,如果$ pattern出现在3个部分的第2个部分部分,增加'middle'的计数器,最后,如果$ pattern在第3部分中出现,则增加'end'的计数器。



  5. 打印开始,中间和结束的计数器,即基本上是每个序列的开始,中间,结束的总和。


    在第一个序列中说,其值分别类似于'2','5','3',并且在第二序列中,值为'4','1','6',最终计数应为'7,6,9'。



我遇到的问题:


  1. 如果将特定序列分为3部分,则可能的$ pattern将丢失。例如,在类似这样的序列上说:每35个字符的长度:


    gggatgtcgatgcatggggatgcatcgatgcgggg

    actagctagcgggatgctacgatggggatgatgat

    aatatcgcggcgcatatatgctagtctatat $$

    em> $ pattern分为前两个部分
    。无论如何说如果$ pattern从第一部分开始到第二部分结束,则增加开始计数。 ?


    ## UPDATE ## 以下问题已得到解决,这归功于Cupidvogel建议的代码


    2。如果序列的长度不能被3整除,如何将其分为3个部分?我尝试使用 int ,但是最后一部分是1-2个
    字符。


    以下是我到目前为止编写的代码。


    它读取文件,显示标题名称和序列,每个序列将被划分为的长度,最后是序列分为3个部分,只要序列长度可以被3整除,就可以很好地工作;对于不能被3整除的序列,最后的第3部分要短1-2个字符。

     #从用户
    打印中获取文件名请输入文件名:
    $ in =<> ;;
    chomp $ in;


    开放(FASTA, $ in)或死亡;
    而(< FASTA>)
    {
    $ / =>;
    @array = split'\n',$ _;
    $ header = shift @array; #Fasta序列的标头
    print \n\nNextNext:\n;
    打印$ header, \n;


    $ seq =加入’,@ array; #序列
    $ seq =〜s / \s // g;
    $ seq =〜s / \ * // g;
    $ seq =〜s /> // g;
    print $ seq, \n\n;

    $ num = int(length($ seq)/ 3);
    @arr = unpack( A $ num A $ num A *,$ seq);
    打印"新方法给出了这个:, @ arr;
    打印 \n第一个元素是:,$ arr [0];
    print \n第二个元素是:,$ arr [1];
    打印 \n第三个元素是:,$ arr [2];



    #下面的代码行最初是为了拆分...
    #...将序列分成3部分,尽管未成功
    #my $ split =(长度$ seq)/ 3;
    #print $ split, \n\n;

    #my $ int = int $ split;
    #print $ int, \n\n;


    #my @ array2 = $ seq =〜/(.{$int})/g;
    #print join(,,@ array2), \n\n;

    #print $ array2 [0], \n,$ array2 [1], \n,$ array2 [2];


    }


    出口;

    我一直在尝试使用以下示例文件编写的代码:sample.fa


    'pre> > ABC_123 2
    atgtcgatcgatcggcgggcatgcgcgcgcggatg
    atatatagcgcgcgctatatagcgcgactctacgc
    atgctgctgactagctatagtcgctgactgcgcgt
    gggaaaaagggcccgggccccgttttggggatcta
    ggggatagctgatgctagcatgcatgctgactgca
    个DEF_456 4
    gggatgtcgatgcatggggatgcatcgatgcgggg
    actagctagcgggatgctacgatggggatgatgat
    aatatcgcggcgcatatatgctagtctatatatta
    个GHI_789 1个
    atagctgctagtcgatcggcgcgggtatcgatcgg
    ggatcgatcgatcggggatcgatcgggggatcgat

    实际的输入文件如下所示:

     > NR_037701 1 
    aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
    tg猫b agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
    gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
    agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
    gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
    gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
    cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
    aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
    actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
    ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
    tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
    cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
    ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
    gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
    cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
    caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
    gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
    ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
    tcccagaagcctgctttttaatgcccgcttaatattatcagagccg海湾合作委员会
    tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
    acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
    aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
    catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
    ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
    tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
    gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
    ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
    atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
    gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
    gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
    gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
    gctccctcttttaaagattttccttccctctttccaactccctgggtcct
    ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
    tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
    ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
    agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
    gtaaatgggcaaaaatcatcccttggcttctcatgcat aatgcatgggca
    cacagactcaaaccctctctcacacacatacacatatacattgttattcc
    acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
    ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
    caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
    tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
    agggttgggacttcaacacagctttttgggggatcataattcaacccatg
    acagccactgagattattatatctccagagaataaatgtgtggagttaaa
    aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
    ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
    gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
    tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
    tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
    aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
    ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
    actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
    gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
    gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
    tatggcagatgctcctgaatgtgtgtttcg agctagaaaatccgggagtg
    gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
    ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
    gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
    aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
    ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
    caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
    actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
    tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    AAA
    > NM_198399 1
    aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
    agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
    caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
    tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
    tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
    attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
    caatagcagaggaaggagggactgagcaggagacggccactccagagaac
    ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
    gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcagg AG
    caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
    gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
    aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
    gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
    ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
    cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
    tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
    atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
    gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
    actgcatatttttacccttatttttgctccttacagcaagattagtaggt
    tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
    tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
    taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
    tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
    attgattaagatatcattatttttgtttggtttggttttgcttttttcct
    cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
    tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
    tcctaagagagtgtttttttttctagcatcattttcttta catgccactc
    atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
    gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
    cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
    aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
    tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
    aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
    gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
    ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
    gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
    tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
    agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
    gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
    gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
    ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
    agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
    tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
    aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
    cagtctgtccttaagttaaaagaattttgctt ttctaatgttatactatt
    tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
    atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
    ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
    aaaaaaaaaaaaa
    个NR_026816 1
    caacccactctctgtgctatgacttcattactctttcccagcccagccct
    gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
    atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
    aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
    agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
    ccagagccaaggctacagctagagagttgactcctctatttgagattgac
    aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
    tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
    cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
    gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
    tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
    tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
    个NR_027917 1个
    atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
    cttcacaatggccatgaac gcctttggagaaatgaccagtgaagaattca
    ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
    ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
    gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
    ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
    tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
    ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
    ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
    cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
    gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
    ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
    gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
    ggtgtag
    个NR_002777 3
    cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
    ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
    gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
    aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
    tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
    aaatgttgtttggagccactgtcacatcaactg tagaaaaattaacaggt
    cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
    taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
    ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
    agcctaatagagaacataaaattctaaaagataaagataataataatgat
    aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
    tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
    aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
    gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
    cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
    tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
    tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
    tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
    gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
    ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
    cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
    cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
    aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
    cttactttctgtgagaacgtgggaa gatcttaacctctcagaagcacagt
    ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
    actcataggacataaaaaaaaaaaaaa
    个NR_033769 1
    ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
    agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
    agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
    aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
    ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
    atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
    cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
    acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
    ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
    tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
    gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
    tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
    gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
    cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
    tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
    ccccacttcatgcagtggc cttcatgaaggccctcatgaaggattcccca
    cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
    ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
    tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
    gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
    tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
    tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
    aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
    aagttaggtttatagagtttgactagttttttcgattagatttgtattag
    ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
    ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
    ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
    taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
    atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
    gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
    tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
    atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
    aaagaaaagaaaaacagagacggtc
    个NM_016326 3
    atgcgcgcaagagag cgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
    cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
    aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
    tttgtaattttatattactttttagtttgatactaagtattaaacatatt
    tctgtattcttccacatattttctgcagttattttaactcagtataggag
    ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
    acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
    gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
    atctctttactgcctggctggccggcagctccg
    个NM_181641 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    TGA ccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
    ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
    cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
    aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
    aaacatatttctgtattcttccacatattttctgcagttattttaactca
    gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
    actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
    ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
    gatctttgaatctctttactgcctggctggccggcagctccg
    个NM_001144931 1
    gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
    ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
    ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
    ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
    gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
    acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
    gaccggtgcgcaggggtagtaggcccggaatatta ttctaaaacacaatc
    agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
    atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
    ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
    cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
    ttgggaggccgag
    个NR_029429 1
    ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
    ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
    caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
    atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
    actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
    tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
    tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
    caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
    cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
    gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
    cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
    gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
    cttcctgc
    > NR_026551 1
    tgtggcct gagaggacggccaggactggccagaaaagagagggacgtggc
    taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
    ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
    gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
    gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
    acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
    gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
    cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
    gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
    gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
    agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
    aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
    cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
    gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
    gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
    tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
    cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
    cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
    gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
    gagcagagccctcactcccaggcagagttgtctgaatccttcct
    个NM_181640 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
    taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
    accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
    atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
    ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
    taattttatattactttttagtttgatactaagtattaaacatatttctg
    tattcttccacatattttctgcagttattttaactcagtataggagctag
    aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
    ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
    gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
    ctttactgcctggctggccggcagctccg
    个NM_016951 3
    atgcgcgcaagagagcgggaagccgagctgggc gagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
    actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
    ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
    gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
    tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
    gaagttttgtaattttatattactttttagtttgatactaagtattaaac
    atatttctgtattcttccacatattttctgcagttattttaactcagtat
    aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
    ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
    cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
    tttgaatctctttactgcctggctggccggcagctccg
    个NR_002773 1
    cagcaccacaccagga ccctccagaggctgtgagaaacatcctgcaccca
    ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
    ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
    ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
    gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
    ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
    gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
    tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
    gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
    ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
    gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
    gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
    tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
    caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
    ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
    acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
    gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
    gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
    cccgctgg actatccagaaggtgttctatcaaggccgctactatgacagc
    ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
    gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
    ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
    agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
    cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
    agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
    ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
    cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
    tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
    acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
    gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
    aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
    gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
    caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
    atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
    gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
    gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
    ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
    ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
    ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
    gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
    cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
    cactccatcctgcgtgactgaac
    个NR_037806 1
    attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
    ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
    ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
    cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
    gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
    catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
    agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
    ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
    gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
    gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
    gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
    gattaaaatcacactaataacccctggatggtcaatctgataataggatc
    agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
    agattattcccagtctcccagtaacacgtttctacccagatcctttttca
    tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
    aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
    ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
    gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
    tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
    tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
    aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
    ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
    ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
    ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
    tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
    aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
    agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
    agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
    gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
    attcactttcaaaccactttcagtaacagcaaattctttagaaaag gaaa
    atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
    gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
    gccctaccttgaaatccatcaggtctgcgttggacbggggbb


    感谢您抽出宝贵的时间解决我的问题!

    解决方案

    我没有将序列分为三部分,而是通过查找完整序列中所有出现的 $ pattern 并确定模式的哪三分之一来实现的



    内置变量 $-[0] 包含最近的开始的偏移量成功匹配。



    下面的代码可以满足我的要求。它通过累加每个序列(在找到新序列ID或到达文件末尾时结束)并将其传递给 process_seq 子例程来工作。



    该子例程获取序列的长度,并计算字符串每三分之一结尾的偏移量。惯用的 sprintf'%.0f',$ value 用于将小数值取整到最接近的字符位置。



    对于序列中每次出现的 $ regex ,都会调整 @counts 数组。 @counts 要增加的元素是通过比较匹配项在 $-[0] 中的开始位置来确定的



    处理完每个序列后, @counts 被累加到 @totals 中,以给出所有序列的总体数字。



    程序在以下情况下的输出显示使用示例数据。总计为(9,1,6)

     使用严格; 
    使用警告;

    我的$ gpat =‘[G] {3,5}’;
    我的$ npat =‘[A-Z] {1,25}’;
    my $ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;
    我的$ regex = qr / $ pattern / i;

    打开我的$ fh,<, sequences.txt或死亡$ !;

    my($ id,$ seq);
    my @totals =(0,0,0);

    而(< $ fh>)){

    chomp;

    if(/ ^>(\w +)/){
    process_seq($ seq)如果$ id;
    $ id = $ 1;
    $ seq =’;
    打印 $ id\n;
    }
    elsif($ id){
    $ seq。= $ _;
    process_seq($ seq)如果eof;
    }
    }

    打印 Total:@ totals\n;



    sub process_seq {

    my $ sequence = shift;
    my $ length =长度$ sequence;

    my @offsets =地图{sprintf'%.0f',$ length * $ _ / 3} 1..3;

    my @counts =(0,0,0);

    而($ sequence =〜/ $ regex / g){
    我的$ place = $-[0];我的$ i的
    (0..2){如果$ place> = $ offsets [$ i],则

    $ counts [$ i] ++;
    最后;
    }
    }

    打印 @ counts\n\n;
    $ totals [$ _] + = $ counts [$ _]为0..2;
    }

    输出

      NR_037701 
    0 0 1

    NM_198399
    1 0 0

    NR_026816
    1 0 1

    NR_027917
    0 0 0

    NR_002777
    0 0 0

    NR_033769
    1 0 0

    NM_016326
    1 0 1

    NM_181641
    1 0 1

    NM_001144931
    0 0 0

    NR_029429
    0 1 0

    NR_026551
    1 0 0

    NM_181640
    1 0 1

    NM_016951
    1 0 1

    NR_002773
    1 0 0

    NR_037806
    0 0 0

    总计:9 1 6


    I am a Perl newbie, stuck with another bioinformatics problem that requires some help and input.

    The problem in brief:

    1. I have a file, which has over 40,000 unique DNA sequences. By unique, I mean unique sequence id. I am attaching a portion of it at the end of my post to help you show what it looks like.

    2. I need to divide each of the 40,000 sequences into 3 parts. So if a particular sequence is 999 character long, each of the 3 parts would have 333 characters.

    3. I need to look for the following pattern through each of the 3 individual parts:

      $gpat = [G]{3,5}; $npat = [A-Z]{1,25};
      $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;

    4. If $pattern appears in the first of the 3 parts, increase the counter of 'beginning', if $pattern occurs in the 2nd of the 3 parts, increase counter of 'middle' and lastly if the $pattern appears in the 3rd part, increase counter of 'end'.

    5. Print the counters of 'beginning','middle' and 'end' i.e basically summation of 'beginning','middle','end' for each of the sequences.

      Say in 1st sequence, the values are like '2','5','3' respectively and in 2nd sequence, the values are '4','1','6', the final count should be '7,6,9'.

    The issues I am having:

    1. If a particular sequence is split into 3 parts, potential $pattern is lost. eg say on a sequence like :

    gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta

    a split into 3 parts produces following 3 sub-parts,each of 35 character length:

    gggatgtcgatgcatggggatgcatcgatgcgggg
    actagctagcgggatgctacgatggggatgatgat
    aatatcgcggcgcatatatgctagtctatatatta

    Hence, $pattern gets split into the first 2 parts. Is there anyway to say "If $pattern begins in 1st part and ends in 2nd part", increase count of "beginning" ?

    ##UPDATE## The following issue has been resolved thanks to the code suggested by Cupidvogel

    2.How do I divide a sequence into 3 parts if its length is not divisible by 3? I tried using int, but then the last part is 1-2 characters short.

    The following is the code I have written so far.

    It reads in the file, displays the header name and sequence, the length into which each sequence will be divided and finally the sequence split into 3 parts which works fine provided the sequence length is divisible by 3, for those which aren't, the final 3rd part is 1-2 characters short.

    #Take Filename from user
    print "Please enter file name : ";
    $in =<>;
    chomp $in;
            
            
    open (FASTA,"$in") or die ;
    while (<FASTA>)
    {
    $/=">";
    @array = split '\n', $_;
    $header=shift @array; # Header of the fasta sequence
    print "\n\nNext sequence: \n";
    print $header,"\n";
                
                
    $seq= join '', @array; # sequence
    $seq=~s/\s//g;
    $seq=~s/\*//g;
    $seq=~s/>//g;
    print $seq,"\n\n";
    
    $num = int(length($seq)/3);
    @arr = unpack("A$num A$num A*",$seq);
    print " New method gives this :", @arr;
    print "\nThe first element is :", $arr[0]; 
    print "\nThe second element is :",$arr[1]; 
    print "\nThe third element is :",$arr[2] ;
    
                
                
    #The following lines of code were originally written to split...
    #...the sequence into 3 parts, albeit unsuccessfully                    
    #my $split = (length $seq)/3;
    #print $split,"\n\n";
             
    #my $int = int $split;
    #print $int,"\n\n";
             
             
    #my @array2 = $seq =~ /(.{$int})/g;
    #print join (" ", @array2),"\n\n";
            
    #print $array2[0],"\n",$array2[1],"\n",$array2[2];
            
                
    }
            
            
    exit;
    

    I have been trying the code I have written so far with the following sample file : sample.fa

    >ABC_123 2
    atgtcgatcgatcggcgggcatgcgcgcgcggatg
    atatatagcgcgcgctatatagcgcgactctacgc
    atgctgctgactagctatagtcgctgactgcgcgt
    gggaaaaagggcccgggccccgttttggggatcta
    ggggatagctgatgctagcatgcatgctgactgca
    >DEF_456 4
    gggatgtcgatgcatggggatgcatcgatgcgggg
    actagctagcgggatgctacgatggggatgatgat
    aatatcgcggcgcatatatgctagtctatatatta
    >GHI_789 1
    atagctgctagtcgatcggcgcgggtatcgatcgg
    ggatcgatcgatcggggatcgatcgggggatcgat
    

    The actual input file looks like the following:

    >NR_037701 1
    aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
    tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
    aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
    ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
    gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
    agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
    gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
    agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
    gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
    gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
    cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
    aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
    actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
    ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
    tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
    cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
    ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
    gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
    cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
    caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
    gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
    ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
    tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
    tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
    acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
    aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
    catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
    ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
    tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
    gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
    ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
    atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
    gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
    gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
    gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
    gctccctcttttaaagattttccttccctctttccaactccctgggtcct
    ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
    tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
    ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
    agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
    gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
    cacagactcaaaccctctctcacacacatacacatatacattgttattcc
    acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
    ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
    caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
    tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
    agggttgggacttcaacacagctttttgggggatcataattcaacccatg
    acagccactgagattattatatctccagagaataaatgtgtggagttaaa
    aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
    ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
    gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
    tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
    tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
    aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
    ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
    actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
    gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
    gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
    tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
    gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
    ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
    gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
    aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
    ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
    caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
    actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
    tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    >NM_198399 1
    aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
    agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
    caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
    tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
    tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
    attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
    caatagcagaggaaggagggactgagcaggagacggccactccagagaac
    ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
    gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
    caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
    gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
    aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
    gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
    ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
    cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
    tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
    atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
    gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
    actgcatatttttacccttatttttgctccttacagcaagattagtaggt
    tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
    tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
    taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
    tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
    attgattaagatatcattatttttgtttggtttggttttgcttttttcct
    cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
    tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
    tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
    atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
    gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
    cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
    aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
    tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
    aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
    gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
    ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
    gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
    tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
    agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
    gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
    gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
    ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
    agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
    tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
    aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
    cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
    tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
    atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
    ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
    aaaaaaaaaaaaa
    >NR_026816 1
    caacccactctctgtgctatgacttcattactctttcccagcccagccct
    gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
    atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
    aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
    agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
    ccagagccaaggctacagctagagagttgactcctctatttgagattgac
    aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
    tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
    cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
    gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
    tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
    tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
    >NR_027917 1
    atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
    cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
    ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
    ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
    gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
    ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
    tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
    ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
    ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
    cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
    gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
    ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
    gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
    ggtgtag
    >NR_002777 3
    cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
    ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
    gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
    aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
    tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
    aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
    cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
    taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
    ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
    agcctaatagagaacataaaattctaaaagataaagataataataatgat
    aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
    tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
    aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
    gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
    cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
    tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
    tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
    tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
    gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
    ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
    cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
    cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
    aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
    cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
    ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
    actcataggacataaaaaaaaaaaaaa
    >NR_033769 1
    ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
    agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
    agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
    aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
    ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
    atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
    cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
    acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
    ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
    tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
    gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
    tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
    gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
    cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
    tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
    ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
    cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
    ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
    tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
    gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
    tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
    tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
    aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
    aagttaggtttatagagtttgactagttttttcgattagatttgtattag
    ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
    ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
    ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
    taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
    atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
    gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
    tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
    atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
    aaagaaaagaaaaacagagacggtc
    >NM_016326 3
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
    cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
    aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
    tttgtaattttatattactttttagtttgatactaagtattaaacatatt
    tctgtattcttccacatattttctgcagttattttaactcagtataggag
    ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
    acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
    gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
    atctctttactgcctggctggccggcagctccg
    >NM_181641 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
    ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
    cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
    aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
    aaacatatttctgtattcttccacatattttctgcagttattttaactca
    gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
    actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
    ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
    gatctttgaatctctttactgcctggctggccggcagctccg
    >NM_001144931 1
    gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
    ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
    ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
    ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
    gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
    acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
    gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
    agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
    atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
    ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
    cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
    ttgggaggccgag
    >NR_029429 1
    ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
    ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
    caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
    atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
    actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
    tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
    tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
    caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
    cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
    gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
    cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
    gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
    cttcctgc
    >NR_026551 1
    tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
    taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
    ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
    gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
    gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
    acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
    gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
    cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
    gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
    gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
    agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
    aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
    cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
    gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
    gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
    tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
    cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
    cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
    gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
    gagcagagccctcactcccaggcagagttgtctgaatccttcct
    >NM_181640 2
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
    taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
    accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
    atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
    ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
    taattttatattactttttagtttgatactaagtattaaacatatttctg
    tattcttccacatattttctgcagttattttaactcagtataggagctag
    aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
    ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
    gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
    ctttactgcctggctggccggcagctccg
    >NM_016951 3
    atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
    ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
    cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
    gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
    tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
    tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
    ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
    acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
    actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
    ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
    gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
    tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
    gaagttttgtaattttatattactttttagtttgatactaagtattaaac
    atatttctgtattcttccacatattttctgcagttattttaactcagtat
    aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
    ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
    cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
    tttgaatctctttactgcctggctggccggcagctccg
    >NR_002773 1
    cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
    ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
    ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
    ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
    gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
    ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
    gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
    tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
    gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
    ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
    gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
    gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
    tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
    caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
    ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
    acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
    gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
    gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
    cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
    ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
    gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
    ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
    agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
    cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
    agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
    ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
    cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
    tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
    acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
    gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
    aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
    gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
    caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
    atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
    gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
    gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
    ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
    ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
    ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
    gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
    cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
    cactccatcctgcgtgactgaac
    >NR_037806 1
    attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
    ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
    ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
    cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
    gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
    catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
    agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
    ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
    gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
    gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
    gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
    gattaaaatcacactaataacccctggatggtcaatctgataataggatc
    agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
    agattattcccagtctcccagtaacacgtttctacccagatcctttttca
    tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
    aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
    ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
    gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
    tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
    tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
    aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
    ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
    ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
    ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
    tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
    aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
    agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
    agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
    gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
    attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
    atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
    gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
    gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
    ggattagctctg
    

    Any help and input would be deeply appreciated.

    Thank you for taking the time to go through my problem!

    解决方案

    Rather than splitting the sequence into three parts, the way I see this working is to find all occurrences of $pattern in the complete sequence and determine in which third the pattern starts.

    The built-in variable $-[0] contains the offset of the start of the most recent successful match.

    The code below does what I think you want. It works by accumulating each sequence (which ends either when a new sequence ID is found or the end of file is reached) and passing it to the process_seq subroutine.

    The subroutine takes the length of the sequence and caclulates the offset of the end of each third of the string. The idiomatic sprintf '%.0f', $value is used to round fractional values to the nearest character position.

    The @counts array is adjusted for each occurrence of $regex in the sequence. The element of @counts to be incremented is established by comparing the starting position of the match in $-[0] with the end offset of each of the three segments of the sequence.

    Once each sequence has been processed the values in @counts are accumulated into @totals to give overall figures for all sequences.

    The output of the program when using your sample data is shown. The grand total is (9, 1, 6).

    use strict;
    use warnings;
    
    my $gpat = '[G]{3,5}';
    my $npat = '[A-Z]{1,25}';
    my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
    my $regex = qr/$pattern/i;
    
    open my $fh, '<', 'sequences.txt' or die $!;
    
    my ($id, $seq);
    my @totals = (0, 0, 0);
    
    while (<$fh>) {
    
      chomp;
    
      if (/^>(\w+)/) {
        process_seq($seq) if $id;
        $id = $1;
        $seq = '';
        print "$id\n";
      }
      elsif ($id) {
        $seq .= $_;
        process_seq($seq) if eof;
      }
    }
    
    print "Total: @totals\n";
    
    
    
    sub process_seq {
    
      my $sequence = shift;
      my $length = length $sequence;
    
      my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;
    
      my @counts = (0, 0, 0);
    
      while ($sequence =~ /$regex/g) {
        my $place = $-[0];
        for my $i (0..2) {
          next if $place >= $offsets[$i];
          $counts[$i]++;
          last;
        }
      }
    
      print "@counts\n\n";
      $totals[$_] += $counts[$_] for 0..2;
    }
    

    output

    NR_037701
    0 0 1
    
    NM_198399
    1 0 0
    
    NR_026816
    1 0 1
    
    NR_027917
    0 0 0
    
    NR_002777
    0 0 0
    
    NR_033769
    1 0 0
    
    NM_016326
    1 0 1
    
    NM_181641
    1 0 1
    
    NM_001144931
    0 0 0
    
    NR_029429
    0 1 0
    
    NR_026551
    1 0 0
    
    NM_181640
    1 0 1
    
    NM_016951
    1 0 1
    
    NR_002773
    1 0 0
    
    NR_037806
    0 0 0
    
    Total: 9 1 6
    

    这篇关于Perl:在数组元素中搜索模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆