在使用data.zip解析Clojure中的XML时出现OutOfMemoryError [英] OutOfMemoryError when parsing XML in Clojure with data.zip
问题描述
我想使用Clojure从Wiktionary XML转储中提取标题.
我用head -n10000 > out-10000.xml
创建了原始怪物文件的较小版本.然后,我使用文本编辑器进行了修整,以使其成为有效的XML.我根据里面的行数(wc -l
)重命名了文件:
(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M
这是XML结构的概述:
<mediawiki>
<page>
<title>dictionary</title>
<revision>
<id>20100608</id>
<parentid>20056528</parentid>
<timestamp>2013-04-06T01:14:29Z</timestamp>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
这是我尝试过的,基于"Clojure XML解析"的答案:
(ns example.core
(:use [clojure.data.zip.xml :only (attr text xml->)])
(:require [clojure.xml :as xml]
[clojure.zip :as zip]))
(defn titles
"Extract titles from +filename+"
[filename]
(let [xml (xml/parse filename)
zipped (zip/xml-zip xml)]
(xml-> zipped :page :title text)))
(count (titles data-9764))
; 38
(count (titles data-99224))
; 779
(count (titles data-995066))
; 5172
(count (titles data-7999931))
; OutOfMemoryError Java heap space java.util.Arrays.copyOfRange (Arrays.java:3209)
我在代码中做错了吗?还是这可能是我正在使用的库中的错误或限制?根据REPL实验,似乎我正在使用的代码是惰性的.在下面,Clojure使用了SAX XML解析器,因此,仅此一个问题就不会成为问题.
另请参阅:
更新2013-04-30:
我想与clojure IRC频道分享一些讨论.我在下面粘贴了一个编辑后的版本. (我删除了用户名,但是如果您想获得信誉,请告诉我;我会编辑并给您链接).
整个标签在
xml/parse
中一次被读入内存, 甚至在您打电话之前就算了.而clojure.xml
使用〜lazy SAX 解析器生成一个渴望的具体集合.延迟处理XML 需要比您想象的还要多的工作-这将是您的工作 这样做,不是某些魔术clojure.xml
可以为您做的.随意反驳 通过调用(count (xml/parse data-whatever))
.
总而言之,甚至在使用zip/xml-zip
之前,此xml/parse
都会导致OutOfMemoryError
具有足够大的文件:
(count (xml/parse filename))
目前,我正在探索其他XML处理选项.我的列表顶部是 clojure.data.xml ,如https://stackoverflow.com/a/9946054/109618 .
这是对拉链数据结构的限制.拉链设计用于有效导航各种树,并支持在树层次结构中上/下/左/右移动,并且可以在近乎恒定的时间内进行就地编辑.
从树上的任何位置开始,拉链都必须能够重建原始树(应用了编辑功能).为此,它会跟踪当前节点,父节点以及树中当前节点左右两侧的所有同级节点,从而大量使用了持久数据结构.
您正在使用的过滤器功能从节点的最左子节点开始,然后逐个向右逐个工作,并在此过程中对谓词进行测试.最左侧孩子的拉链以其左侧兄弟姐妹的空向量开始(请注意:l (conj l node) /src/clj/clojure/zip.clj#L149"rel =" nofollow noreferrer> zip/右).到最右边的孩子时,您已经在树中建立了该级别所有节点的内存向量,对于像您这样的宽树,这可能会导致OOM错误.>
作为一种解决方法,如果您知道顶级元素只是一个<page>
元素列表的容器,我建议您使用拉链在页面元素中导航,然后使用map
进行处理页面:
(defn titles
"Extract titles from +filename+"
[filename]
(let [xml (xml/parse filename)]
(map #(xml-> (zip/xml-zip %) :title text)
(:content xml))))
因此,基本上,我们避免在整个xml输入的顶层使用zip抽象,从而避免将整个xml保留在内存中.这意味着对于每个第一级子级都很大的更大的xml,我们可能不得不在XML结构的第二级中再次跳过使用拉链,依此类推...
I want to use Clojure to extract the titles from a Wiktionary XML dump.
I used head -n10000 > out-10000.xml
to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (wc -l
):
(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M
Here is the overview of the XML structure:
<mediawiki>
<page>
<title>dictionary</title>
<revision>
<id>20100608</id>
<parentid>20056528</parentid>
<timestamp>2013-04-06T01:14:29Z</timestamp>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
Here is what I've tried, based on this answer to 'Clojure XML Parsing':
(ns example.core
(:use [clojure.data.zip.xml :only (attr text xml->)])
(:require [clojure.xml :as xml]
[clojure.zip :as zip]))
(defn titles
"Extract titles from +filename+"
[filename]
(let [xml (xml/parse filename)
zipped (zip/xml-zip xml)]
(xml-> zipped :page :title text)))
(count (titles data-9764))
; 38
(count (titles data-99224))
; 779
(count (titles data-995066))
; 5172
(count (titles data-7999931))
; OutOfMemoryError Java heap space java.util.Arrays.copyOfRange (Arrays.java:3209)
Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.
See also:
Update 2013-04-30:
I'd like to share some discussion from the clojure IRC channel. I've pasted an edited version below. (I removed the user names, but if you want credit, just let me know; I'll edit and give you a link.)
The entire tag is read into memory at once in
xml/parse
, long before you even call count. Andclojure.xml
uses the ~lazy SAX parser to produce an eager concrete collection. Processing XML lazily requires a lot more work than you think - and it would be work you do, not some magicclojure.xml
could do for you. Feel free to disprove by calling(count (xml/parse data-whatever))
.
To summarize, even before using zip/xml-zip
, this xml/parse
causes an OutOfMemoryError
with a large enough file:
(count (xml/parse filename))
At present, I am exploring other XML processing options. At the top of my list is clojure.data.xml as mentioned at https://stackoverflow.com/a/9946054/109618.
It's a limitation of the zipper data structure. Zippers are designed for efficiently navigating trees of various sorts, with support for moving up/down/left/right in the tree hierarchy, with in-place edits in near-constant time.
From any position in the tree, the zipper needs to be able to re-construct the original tree (with edits applied). To do that, it keeps track of the current node, the parent node, and all siblings to the left and right of the current node in the tree, making heavy use of persistent data structures.
The filter functions that you're using start at the left-most child of a node and work their way one-by-one to the right, testing predicates along the way. The zipper for the left-most child starts out with an empty vector for its left-hand siblings (note the :l []
part in the source for zip/down). Each time you move right, it will add the last node visited to the vector of left-hand siblings (:l (conj l node)
in zip/right). By the time you arrive at the right-most child, you've built up an in-memory vector of all the nodes in that level in the tree, which, for a wide tree like yours, could cause an OOM error.
As a workaround, if you know that the top-level element is just a container for a list of <page>
elements, I'd suggest using the zipper to navigate within the page elements and just use map
to process the pages:
(defn titles
"Extract titles from +filename+"
[filename]
(let [xml (xml/parse filename)]
(map #(xml-> (zip/xml-zip %) :title text)
(:content xml))))
So, basically, we're avoiding using the zip abstraction for the top level of the overall xml input, and thusly avoid its holding the entire xml in memory. This implies that for even huger xml, where each first-level child is huge, we may have to skip using the zipper once again in the second level of the XML structure, and so on...
这篇关于在使用data.zip解析Clojure中的XML时出现OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!