“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？ [英] "average length of the sequences in a fasta file": Can you improve this Erlang code?

查看：184 发布时间：2017/8/27 12:23:11 string erlang sequence bioinformatics mean

本文介绍了“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使用 Erlang获取 fasta序列的平均长度即可。一个fasta文件看起来像这样

 > title1 
 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC 
 ATCGATCGCATCGATGCTACGATCGATCATATA 
 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC 
 ATCGATCGCATCGATGCTACGATCTCGTACGC 
个TITLE2 
 ATCGATCGCATCGATGCTACGATCTCGTACGC 
 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC 
 ATCGATCGCATCGATGCTACGATCGATCATATA 
 ATGACTAGCTAGCAGCGATCGACCGTCGTACGC 
个TITLE3 
 ATCGATCGCATCGAT（...）

我尝试使用以下 Erlang 代码回答此问题：

  -module（高尔夫）。 
 -export（[test / 0]）。 
 
 line（[]，{Sequences，Total}） - > {序列，共有}; 
 line（>++ Rest，{Sequences，Total}） - > {序列+ 1，总}; 
 line（L，{Sequences，Total}） - > {序列，总共+字符串：LEN（字符串：带材（L））}。 
 
 scanLines（S，序列，总计） - > 
 case io：get_line（S，''）
 eof  - > {序列，共有}; 
 {error，_}  - > {Sequences，Total}; 
行 - > {S2，T2} = line（Line，{Sequences，Total}），scanLines（S，S2，T2）
 end。 
 
 test（） - > 
 {Sequences，Total} = scanLines（standard_io，0,0），
 io：format（〜p\\\
，[Total /（1.0 * Sequences）]），
停（）。

编译/执行：

  erlc golf.erl 
 erl -noshell -s高尔夫测试< sequence.fasta 
 563.16

此代码似乎适用于一个小的fasta文件，但它需要几个小时来解析一个较大的（> 100Mo）。为什么我是一个Erlang新手，可以改进这个代码吗？

解决方案

如果你需要真正快速的IO，那么你必须做比平常更棘手一点。

  -module（g）。 
 -export（[s / 0]）。 
 s（） - > 
 P = open_port（{fd，0，1}，[in，binary，{line，256}]），
 r（P，0，0），
 halt（）。 
 r（P，C，L） - > 
接收
 {P，{data，{eol，< $>：8，_ / binary>>}}}  - > 
 r（P，C + 1，L）; 
 {P，{data，{eol，Line}}}  - > 
 r（P，C，L + size（Line））; 
 {'EXIT'，P，normal}  - > 
 io：format（〜p〜n，[L / C]）
 end。

它是我知道的最快的IO，但注意-noshell -noinput 。编译就像 erlc + native +{hipe，[o3]}g.erl 但使用 -smp disable

  erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd / home / hynek /下载@option native @option'{hipe，[o3]}'@files g.erl

并运行：

  time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -sgs< uniprot_sprot.fasta 
 352.6697028442464 
 
 real 0m3.241s 
用户0m3.060s 
 sys 0m0.124s

使用 -smp启用但本机需要：

  $ erlc + native +{hipe，[o3]}g.erl 
 $ time erl -noshell -mode最小-boot start_clean -noinput -sg s< uniprot_sprot $。 $ p> 
 
 字节代码，但带有 -smp disable （几乎与native一样，因为大部分工作都是在端口完成的） ：
  $ erlc g.erl 
 $ time erl -smp disable -noshell -mode最小-boot start_clean -noinput -sg s< uniprot_sprot.fasta 
 352.6697028442464 
 
 real 0m3.565s 
用户0m3.436s 
 sys 0m0.104s 
  
仅用于smp完整的字节码：
  $ time erl -noshell -mode minimal -boot start_clean -noinput -sg s< uniprot_sprot.fasta 
 352.6697028442464 
 
 real 0m5.433s 
用户0m5.236s 
 sys 0m0.128s 
  
比较 sarnold  版本给我错误的答案，并且需要更多的相同的HW：
  $ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd / home / hynek /下载@option native @option'{hipe，[o3]}'@files golf.erl 
 ./golf.erl:5：Warning：variable'休息'未使用
 $ time erl -smp disable -noshell -mode最小-s高尔夫测试
 359.04679841439776 
 
 real 0m17.569s 
用户0m16.749s 
 sys 0m0.664s 
  
 编辑：我已经看过 uniprot_sprot.f asta ，我有点惊讶。它是3824397行和232MB。这意味着 -smp禁用版本可以处理每秒118万个文本行（71MB / s的线性IO）。
 
I'm trying to get the mean length of fasta sequences using Erlang. A fasta file looks like this
>title1
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCTCGTACGC
>title2
ATCGATCGCATCGATGCTACGATCTCGTACGC
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
ATCGATCGCATCGATGCTACGATCGATCATATA
ATGACTAGCTAGCAGCGATCGACCGTCGTACGC
>title3
ATCGATCGCATCGAT(...)
I tried to answser this question using the following Erlang code:
-module(golf).
-export([test/0]).

line([],{Sequences,Total}) ->  {Sequences,Total};
line(">" ++ Rest,{Sequences,Total}) -> {Sequences+1,Total};
line(L,{Sequences,Total}) -> {Sequences,Total+string:len(string:strip(L))}.

scanLines(S,Sequences,Total)->
        case io:get_line(S,'') of
            eof -> {Sequences,Total};
            {error,_} ->{Sequences,Total};
            Line -> {S2,T2}=line(Line,{Sequences,Total}), scanLines(S,S2,T2)
        end  .

test()->
    {Sequences,Total}=scanLines(standard_io,0,0),
    io:format("~p\n",[Total/(1.0*Sequences)]),
    halt().
Compilation/Execution:
erlc golf.erl
erl -noshell -s golf test < sequence.fasta
563.16
this code seems to work fine for a small fasta file but it takes hours to parse a larger one (>100Mo). Why ? I'm an Erlang newbie, can you please improve this code ?
 解决方案 
If you need really fast IO then you have to do little bit more trickery than usual.
-module(g).
-export([s/0]).
s()->
  P = open_port({fd, 0, 1}, [in, binary, {line, 256}]),
  r(P, 0, 0),
  halt().
r(P, C, L) ->
  receive
    {P, {data, {eol, <<$>:8, _/binary>>}}} ->
      r(P, C+1, L);
    {P, {data, {eol, Line}}} ->
      r(P, C, L + size(Line));
    {'EXIT', P, normal} ->
      io:format("~p~n",[L/C])
  end.
It is fastest IO as I know but note -noshell -noinput.
Compile just like erlc +native +"{hipe, [o3]}" g.erl but with -smp disable
erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files g.erl
and run:
time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s < uniprot_sprot.fasta
352.6697028442464

real    0m3.241s
user    0m3.060s
sys     0m0.124s
With -smp enable but native it takes:
$ erlc +native +"{hipe, [o3]}" g.erl
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m5.103s
user    0m4.944s
sys     0m0.112s
Byte code but with -smp disable (almost in par with native because most of work is done in port!):
$ erlc g.erl
$ time erl -smp disable -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta
352.6697028442464

real    0m3.565s
user    0m3.436s
sys     0m0.104s
Just for completeness byte code with smp:
$ time erl -noshell -mode minimal -boot start_clean -noinput -s g s<uniprot_sprot.fasta 
352.6697028442464

real    0m5.433s
user    0m5.236s
sys     0m0.128s
For comparison sarnold version gives me wrong answer and takes more on same HW:
$ erl -smp disable -noinput -mode minimal -boot start_clean -s erl_compile compile_cmdline @cwd /home/hynek/Download @option native @option '{hipe, [o3]}' @files golf.erl
./golf.erl:5: Warning: variable 'Rest' is unused
$ time erl -smp disable -noshell -mode minimal -s golf test
359.04679841439776

real    0m17.569s
user    0m16.749s
sys     0m0.664s
EDIT: I have looked at characteristics of uniprot_sprot.fasta and I'm little bit surprised. It is 3824397 rows and 232MB. It means that -smp disabled version can handle 1.18 million text lines per second (71MB/s in line oriented IO).

                        这篇关于“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？ [英] "average length of the sequences in a fasta file": Can you improve this Erlang code?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？ [英] &quot;average length of the sequences in a fasta file&quot;: Can you improve this Erlang code?

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

“fasta文件中的序列的平均长度”：您可以改进这个Erlang代码吗？ [英] "average length of the sequences in a fasta file": Can you improve this Erlang code?

登录关闭