改进clojure lazy-seq用于迭代文本解析 [英] Improving clojure lazy-seq usage for iterative text parsing

查看：113 发布时间：2016/11/27 20:20:08 clojure lazy-evaluation bioinformatics

本文介绍了改进clojure lazy-seq用于迭代文本解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在写一个Clojure实现此编码挑战，尝试以Fasta格式查找序列记录的平均长度：

I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

有关更多背景，请参阅

For more background see this related StackOverflow post about an Erlang solution.

我的初学者Clojure尝试使用lazy-seq尝试在文件中读取一个记录，因此它将扩展到大文件。然而，它是相当内存饥饿和缓慢，所以我怀疑，它没有实现最佳。以下是使用 BioJava 库来解析记录的解决方案：

My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

和一个没有外部库的等价方法：

and an equivalent approach without external libraries:

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

对于大型文件，当前实现需要44秒。你能提供任何建议，加速代码，使它更直观吗？使用lazy-seq是否正确地按记录正确解析文件记录？

The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?

改进clojure lazy-seq用于迭代文本解析 [英] Improving clojure lazy-seq usage for iterative text parsing

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

改进clojure lazy-seq用于迭代文本解析 [英] Improving clojure lazy-seq usage for iterative text parsing

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭