Perl6:处理超大文件的最佳方法是什么? [英] Perl6 : What is the best way for dealing with very big files?

查看:88
本文介绍了Perl6:处理超大文件的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上周,我决定尝试一下Perl6,并开始重新实现我的程序之一. 我不得不说,Perl6是如此容易进行对象编程,这对我来说在Perl5中是一个非常痛苦的方面.

我的程序必须读取和存储大文件,例如整个基因组(最大3 Gb和更多,请参见下面的示例1)或将数据制成表格.

该代码的第一个版本是通过Perl5方式逐行迭代("genome.fa" .IO.lines)来制作的.这是非常缓慢且无法正确执行的时间.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    for $!file.IO.lines -> $line {
      if $line ~~ /^\>/ {
        say $id;
        if $id.defined {
          %!seq{$id} = sequence.new(id => $id, seq => $s);
        }
        my $l = $line;
        $l ~~ s:g/^\>//;
        $id = $l;
        $s = "";
      }
      else {
        $s ~= $line;
      }
    }
    %!seq{$id} = sequence.new(id => $id, seq => $s);
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

因此,在经过一点点RTFM之后,我更改了文件名,对\ n进行了分割,并使用for循环对其进行了解析.这样,我设法在2分钟内加载了数据.好多了,但还不够.通过作弊,我的意思是通过删除最大值\ n(示例2),我将执行时间减少到30秒.这种Fasta格式不是很好,但不是完全满意.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    say "Slurping ...";
    my $f = $!file.IO.slurp;

    say "Spliting file ...";
    my @lines = $f.split(/\n/);

    say "Parsing lines ...";
    for @lines -> $line {
      if $line !~~ /^\>/ {
          $s ~= $line;
      }
      else {
        say $id;
        if $id.defined {
          %!seq{$id} = seq.new(id => $id, seq => $s);
        }
        $id = $line;
        $id ~~ s:g/^\>//;
        $s = "";
      }
    }
    %!seq{$id} = seq.new(id => $id, seq => $s);
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

因此再次进行RTFM,我发现了语法的魔力.因此,无论使用哪种fasta格式,新版本的执行时间均为45秒.不是最快的方法,而是更优雅,更稳定.

my grammar fastaGrammar {
  token TOP { <fasta>+ }

  token fasta   {<.ws><header><seq> }
  token header  { <sup><id>\n }
  token sup     { '>' }
  token id      { <[\d\w]>+ }
  token seq     { [<[ACGTNacgtn]>+\n]+ }

}

my class fastaActions {
  method TOP ($/){
    my @seqArray;

    for $<fasta> -> $f {
      @seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
    }
    make @seqArray;
  }

  method fasta ($/) { make ~$/; }
  method id    ($/) { make ~$/; }
  method seq   ($/) { make $/.subst("\n", "", :g); }

}

my class fasta {
  has Str $.file is required;
  has %seq;

  submethod TWEAK() {

    say "=> Slurping ...";
    my $f = $!file.IO.slurp;

    say "=> Grammaring ...";
    my @seqArray = fastaGrammar.parse($f, actions => fastaActions).made;

    say "=> Storing data ...";
    for @seqArray -> $s {
      %!seq{$s.id} = $s;
    }
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

我认为我找到了处理这类大文件的好方法,但是性能仍然低于Perl5.

作为Perl6的新手,我想知道是否有更好的方法来处理大数据,或者是否由于Perl6的实施而受到某些限制?

作为Perl6的新手,我会问两个问题:

  • 还有其他我不知道的Perl6机制吗? 有记载,用于存储文件中的大量数据(例如我的基因组)吗?
  • 我是否达到了当前版本的最高性能 Perl6吗?

感谢您的阅读!


Fasta示例1:

>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...

Fasta示例2:

>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...            
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...

编辑 我应用了@Christoph和@timotimo的建议,并使用代码进行了测试:

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    say "=> Slurping / Parsing / Storing ...";
    %!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
  .head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
    }
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

程序在2.7s内完成,真是太好了! 我也在小麦基因组(10 Gb)上尝试了此代码.它在35.2秒内完成. 终于Perl6并没有那么慢!

非常感谢您的帮助!

解决方案

一个简单的改进是使用固定宽度的编码(例如latin1)来加快字符解码的速度,尽管我不确定这有多大帮助

就Rakudo的正则表达式/语法引擎而言,我发现它的运行速度很慢,因此确实有必要采用更底层的方法.

我没有进行任何基准测试,但是我首先尝试的是这样的事情:

my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
    .[0] => .[1..*].join given .split("\n");
}

由于Perl6标准库是在Perl6本身中实现的,因此有时可以通过避免使用以命令式编写代码的方式来提高性能,例如:

my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
    $pos = $data.index('>', $pos) // last;

    my $ks = $pos + 1;
    my $ke = $data.index("\n", $ks);

    my $ss = $ke + 1;
    my $se = $data.index('>', $ss) // $data.chars;

    my @lines;

    $pos = $ss;
    while $pos < $se {
        my $end = $data.index("\n", $pos);
        @lines.push($data.substr($pos..^$end));
        $pos = $end + 1
    }

    %seqs{$data.substr($ks..^$ke)} = @lines.join;
}

但是,如果所使用的标准库的某些部分表现出一定的性能,那么实际上可能会使情况变得更糟.在这种情况下,下一步就是添加低级类型注释,例如strint,并用一篇文章关于它.

如果我的简短版本是基准版本,则命令版本可以将性能提高2.4倍.

他实际上设法通过将其重写为简短版本,将其压缩了3倍.

my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
    .head => .skip(1).join given .split("\n").cache;
}

最后,使用NQP内置脚本重写命令式版本的速度提高了17倍,但是鉴于潜在的可移植性问题,通常不鼓励编写此类代码,但是如果您确实需要这种性能水平,现在可能有必要编写代码:

use nqp;

my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;

my str @lines;

loop {
    $pos = nqp::index($data, '>', $pos);

    last if $pos < 0;

    my int $ks = $pos + 1;
    my int $ke = nqp::index($data, "\n", $ks);

    my int $ss = $ke + 1;
    my int $se = nqp::index($data ,'>', $ss);

    if $se < 0 {
        $se = nqp::chars($data);
    }

    $pos = $ss;
    my int $end;

    while $pos < $se {
        $end = nqp::index($data, "\n", $pos);
        nqp::push_s(@lines, nqp::substr($data, $pos, $end - $pos));
        $pos = $end + 1
    }

    nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", @lines));
    nqp::setelems(@lines, 0);
}

Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5.

My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data.

The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    for $!file.IO.lines -> $line {
      if $line ~~ /^\>/ {
        say $id;
        if $id.defined {
          %!seq{$id} = sequence.new(id => $id, seq => $s);
        }
        my $l = $line;
        $l ~~ s:g/^\>//;
        $id = $l;
        $s = "";
      }
      else {
        $s ~= $line;
      }
    }
    %!seq{$id} = sequence.new(id => $id, seq => $s);
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

So after a little bit of RTFM, I changed for a slurp on the file, a split on the \n which I parsed with a for loop. This way I managed to load the data in 2 min. Much better but not enough. By cheating, I mean by removing a maximum of \n (Example 2), I decreased the execution time to 30 seconds. Quite good, but not totaly satisfied, by this fasta format is not the most used.

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    my $id;
    my $s;

    say "Slurping ...";
    my $f = $!file.IO.slurp;

    say "Spliting file ...";
    my @lines = $f.split(/\n/);

    say "Parsing lines ...";
    for @lines -> $line {
      if $line !~~ /^\>/ {
          $s ~= $line;
      }
      else {
        say $id;
        if $id.defined {
          %!seq{$id} = seq.new(id => $id, seq => $s);
        }
        $id = $line;
        $id ~~ s:g/^\>//;
        $s = "";
      }
    }
    %!seq{$id} = seq.new(id => $id, seq => $s);
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

So RTFM again and I discovered the magic of Grammar. So new version and an execution time of 45 seconds whatever the fasta format used. Not the fastest way but more elegant and stable.

my grammar fastaGrammar {
  token TOP { <fasta>+ }

  token fasta   {<.ws><header><seq> }
  token header  { <sup><id>\n }
  token sup     { '>' }
  token id      { <[\d\w]>+ }
  token seq     { [<[ACGTNacgtn]>+\n]+ }

}

my class fastaActions {
  method TOP ($/){
    my @seqArray;

    for $<fasta> -> $f {
      @seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
    }
    make @seqArray;
  }

  method fasta ($/) { make ~$/; }
  method id    ($/) { make ~$/; }
  method seq   ($/) { make $/.subst("\n", "", :g); }

}

my class fasta {
  has Str $.file is required;
  has %seq;

  submethod TWEAK() {

    say "=> Slurping ...";
    my $f = $!file.IO.slurp;

    say "=> Grammaring ...";
    my @seqArray = fastaGrammar.parse($f, actions => fastaActions).made;

    say "=> Storing data ...";
    for @seqArray -> $s {
      %!seq{$s.id} = $s;
    }
  }
}

sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

I think that I found good solution to handle these kind of big files, but performances are still under those of Perl5.

As a newbie in Perl6, I would be interested to know if there is better ways to deal with big data or if there is some limitation due to the Perl6 implementation ?

As a newbie in Perl6, I would ask two questions :

  • Is there other Perl6 mechanisms that I'm not aware yet, or not yet documented, for storing huge data from a file (like my genomes) ?
  • Did I reach the maximum performances for the current version of Perl6 ?

Thanks for reading !


Fasta Example 1 :

>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...

Fasta example 2 :

>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...            
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...

EDIT I applied advises of @Christoph and @timotimo and test with code:

my class fasta {
  has Str $.file is required;
  has %!seq;

  submethod TWEAK() {
    say "=> Slurping / Parsing / Storing ...";
    %!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
  .head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
    }
  }
}


sub MAIN()
{
    my $f = fasta.new(file => "genome.fa");
}

The program finished in 2.7s, which is so great ! I also tried this code on the wheat genome (10 Gb). It finished in 35.2s. Perl6 is not so slow finally !

Big Thank for the help !

解决方案

One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.

As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.

I did not do any benchmarking, but what I'd try first is something like this:

my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
    .[0] => .[1..*].join given .split("\n");
}

As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:

my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
    $pos = $data.index('>', $pos) // last;

    my $ks = $pos + 1;
    my $ke = $data.index("\n", $ks);

    my $ss = $ke + 1;
    my $se = $data.index('>', $ss) // $data.chars;

    my @lines;

    $pos = $ss;
    while $pos < $se {
        my $end = $data.index("\n", $pos);
        @lines.push($data.substr($pos..^$end));
        $pos = $end + 1
    }

    %seqs{$data.substr($ks..^$ke)} = @lines.join;
}

However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.

If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.


Note that @timotimo has done some performance measurements and wrote an article about it.

If my short version is the baseline, the imperative version improves performance by 2.4x.

He actually managed to squeeze a 3x improvement out of the short version by rewriting it to

my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
    .head => .skip(1).join given .split("\n").cache;
}

Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:

use nqp;

my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;

my str @lines;

loop {
    $pos = nqp::index($data, '>', $pos);

    last if $pos < 0;

    my int $ks = $pos + 1;
    my int $ke = nqp::index($data, "\n", $ks);

    my int $ss = $ke + 1;
    my int $se = nqp::index($data ,'>', $ss);

    if $se < 0 {
        $se = nqp::chars($data);
    }

    $pos = $ss;
    my int $end;

    while $pos < $se {
        $end = nqp::index($data, "\n", $pos);
        nqp::push_s(@lines, nqp::substr($data, $pos, $end - $pos));
        $pos = $end + 1
    }

    nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", @lines));
    nqp::setelems(@lines, 0);
}

这篇关于Perl6:处理超大文件的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆