改进clojure lazy-seq用于迭代文本解析 [英] Improving clojure lazy-seq usage for iterative text parsing

查看:113
本文介绍了改进clojure lazy-seq用于迭代文本解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一个Clojure实现此编码挑战,尝试以Fasta格式查找序列记录的平均长度:

I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

有关更多背景,请参阅

For more background see this related StackOverflow post about an Erlang solution.

我的初学者Clojure尝试使用lazy-seq尝试在文件中读取一个记录,因此它将扩展到大文件。然而,它是相当内存饥饿和缓慢,所以我怀疑,它没有实现最佳。以下是使用 BioJava 库来解析记录的解决方案:

My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

和一个没有外部库的等价方法:

and an equivalent approach without external libraries:

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

对于大型文件,当前实现需要44秒。你能提供任何建议,加速代码,使它更直观吗?使用lazy-seq是否正确地按记录正确解析文件记录?

The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?

推荐答案

这可能没关系, 平均正在持有长度的seq的头。

以下是一个完全未经测试,但lazier方式做我想想你想要的。

It probably doesn't matter, but average is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v val) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

这篇关于改进clojure lazy-seq用于迭代文本解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆