如何在Clojure中处理大型二进制数据? [英] How to process large binary data in Clojure?

查看:185
本文介绍了如何在Clojure中处理大型二进制数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在Clojure中处理大型二进制数据文件?让我们假设数据/文件大约是50MB - 小到足以在内存中处理(但是不能用朴素的实现)。

How does one process large binary data files in Clojure? Let's assume data/files are about 50MB - small enough to be processed in memory (but not with a naive implementation).

下面的代码从小文件中正确删除^ M但对于较大的文件(如6MB),它会引发 OutOfMemoryError

The following code correctly removes ^M from small files but it throws OutOfMemoryError for larger files (like 6MB):

(defn read-bin-file [file]
  (to-byte-array (as-file file)))

(defn remove-cr-from-file [file]
  (let [dirty-bytes (read-bin-file file)
        clean-bytes (filter #(not (= 13 %)) dirty-bytes)
        changed?    (< (count clean-bytes) (alength dirty-bytes))]    ; OutOfMemoryError
    (if changed?
      (write-bin-file file clean-bytes))))    ; writing works fine

看起来 Java字节数组不能被当作seq <

It seems that Java byte arrays can't be treated as seq as it is extremely inefficient.

另一方面,使用 aset aget areduce 是ated肿,丑陋和强制,因为你不能真正使用Clojure序列库。

On the other hand, solutions with aset, aget and areduce are bloated, ugly and imperative because you can't really use Clojure sequence library.

我缺少什么? 如何处理Clojure中的大型二进制数据文件?

推荐答案

aset / areduce在这里 - 他们可能是命令式的,但他们是有用的工具,当处理数组,我不觉得它们特别丑陋。如果你想把它们包装在一个不错的函数,那么当然你可以: - )

I would probably personally use aget / aset / areduce here - they may be imperative but they are useful tools when dealing with arrays, and I don't find them particularly ugly. If you want to wrap them in a nice function then of course you can :-)

如果你决定使用序列,那么你的问题将在构造和遍历seq,因为这将需要为数组中的每个字节创建和存储一个新的seq对象。这可能是〜24字节的每个数组字节......

If you are determined to use sequences, then your problem will be in the construction and traversal of the seq since this will require creation and storage of a new seq object for every byte in the array. This is probably ~24 bytes for each array byte......

所以诀窍是让它工作懒惰,在这种情况下,早期的对象将在你到达数组的末尾之前收集垃圾。然而,为了使这个工作,你必须避免保持任何引用seq的头部,当你遍历序列(例如与count)。

So the trick is to get it to work lazily, in which case the earlier objects will be garbage collected before you get to the end of the the array. However to make this work, you'll have to avoid holding any reference to the head of the seq when you traverse the sequence (e.g. with count).

以下可能工作(未测试),但将取决于以懒惰友好方式实现的写bin文件:

The following might work (untested), but will depend on write-bin-file being implemented in a lazy-friendly manner:

(defn remove-cr-from-file [file]
  (let [dirty-bytes (read-bin-file file)
        clean-bytes (filter #(not (= 13 %)) dirty-bytes)
        changed-bytes (count (filter #(not (= 13 %)) dirty-bytes))
        changed?    (< changed-bytes (alength dirty-bytes))]   
    (if changed?
      (write-bin-file file clean-bytes))))

注意这和你的代码基本相同,但是构造一个单独的延迟序列来计算改变的字节数。

Note this is essentially the same as your code, but constructs a separate lazy sequence to count the number of changed bytes.

这篇关于如何在Clojure中处理大型二进制数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆