在使用data.zip解析Clojure中的XML时出现OutOfMemoryError [英] OutOfMemoryError when parsing XML in Clojure with data.zip

查看:82
本文介绍了在使用data.zip解析Clojure中的XML时出现OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Clojure从Wiktionary XML转储中提取标题.

我用head -n10000 > out-10000.xml创建了原始怪物文件的较小版本.然后,我使用文本编辑器进行了修整,以使其成为有效的XML.我根据里面的行数(wc -l)重命名了文件:

(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M

这是XML结构的概述:

<mediawiki>
  <page>
    <title>dictionary</title>
    <revision>
      <id>20100608</id>
      <parentid>20056528</parentid>
      <timestamp>2013-04-06T01:14:29Z</timestamp>
      <text xml:space="preserve">
        ...
      </text>
    </revision>
  </page>
</mediawiki>

这是我尝试过的,基于"Clojure XML解析"的答案:

(ns example.core
  (:use [clojure.data.zip.xml :only (attr text xml->)])
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]))

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)
        zipped (zip/xml-zip xml)]
    (xml-> zipped :page :title text)))

(count (titles data-9764))
; 38

(count (titles data-99224))
; 779

(count (titles data-995066))
; 5172

(count (titles data-7999931))
; OutOfMemoryError Java heap space  java.util.Arrays.copyOfRange (Arrays.java:3209)

我在代码中做错了吗?还是这可能是我正在使用的库中的错误或限制?根据REPL实验,似乎我正在使用的代码是惰性的.在下面,Clojure使用了SAX XML解析器,因此,仅此一个问题就不会成为问题.

另请参阅:

更新2013-04-30:

我想与clojure IRC频道分享一些讨论.我在下面粘贴了一个编辑后的版本. (我删除了用户名,但是如果您想获得信誉,请告诉我;我会编辑并给您链接).

整个标签在xml/parse中一次被读入内存, 甚至在您打电话之前就算了.而clojure.xml使用〜lazy SAX 解析器生成一个渴望的具体集合.延迟处理XML 需要比您想象的还要多的工作-这将是的工作 这样做,不是某些魔术clojure.xml可以为您做的.随意反驳 通过调用(count (xml/parse data-whatever)).

总而言之,甚至在使用zip/xml-zip之前,此xml/parse都会导致OutOfMemoryError具有足够大的文件:

(count (xml/parse filename))

目前,我正在探索其他XML处理选项.我的列表顶部是 clojure.data.xml ,如https://stackoverflow.com/a/9946054/109618 .

解决方案

这是对拉链数据结构的限制.拉链设计用于有效导航各种树,并支持在树层次结构中上/下/左/右移动,并且可以在近乎恒定的时间内进行就地编辑.

从树上的任何位置开始,拉链都必须能够重建原始树(应用了编辑功能).为此,它会跟踪当前节点,父节点以及树中当前节点左右两侧的所有同级节点,从而大量使用了持久数据结构.

您正在使用的过滤器功能从节点的最左子节点开始,然后逐个向右逐个工作,并在此过程中对谓词进行测试.最左侧孩子的拉链以其左侧兄弟姐妹的空向量开始(请注意:l (conj l node) /src/clj/clojure/zip.clj#L149"rel =" nofollow noreferrer> zip/右).到最右边的孩子时,您已经在树中建立了该级别所有节点的内存向量,对于像您这样的宽树,这可能会导致OOM错误.

作为一种解决方法,如果您知道顶级元素只是一个<page>元素列表的容器,我建议您使用拉链在页面元素中导航,然后使用map进行处理页面:

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)]
    (map #(xml-> (zip/xml-zip %) :title text)
         (:content xml))))

因此,基本上,我们避免在整个xml输入的顶层使用zip抽象,从而避免将整个xml保留在内存中.这意味着对于每个第一级子级都很大的更大的xml,我们可能不得不在XML结构的第二级中再次跳过使用拉链,依此类推...

I want to use Clojure to extract the titles from a Wiktionary XML dump.

I used head -n10000 > out-10000.xml to create smaller versions of the original monster file. Then I trimmed with a text editor to make it valid XML. I renamed the files according to the number of lines inside (wc -l):

(def data-9764 "data/wiktionary-en-9764.xml") ; 354K
(def data-99224 "data/wiktionary-en-99224.xml") ; 4.1M
(def data-995066 "data/wiktionary-en-995066.xml") ; 34M
(def data-7999931 "data/wiktionary-en-7999931.xml") ; 222M

Here is the overview of the XML structure:

<mediawiki>
  <page>
    <title>dictionary</title>
    <revision>
      <id>20100608</id>
      <parentid>20056528</parentid>
      <timestamp>2013-04-06T01:14:29Z</timestamp>
      <text xml:space="preserve">
        ...
      </text>
    </revision>
  </page>
</mediawiki>

Here is what I've tried, based on this answer to 'Clojure XML Parsing':

(ns example.core
  (:use [clojure.data.zip.xml :only (attr text xml->)])
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]))

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)
        zipped (zip/xml-zip xml)]
    (xml-> zipped :page :title text)))

(count (titles data-9764))
; 38

(count (titles data-99224))
; 779

(count (titles data-995066))
; 5172

(count (titles data-7999931))
; OutOfMemoryError Java heap space  java.util.Arrays.copyOfRange (Arrays.java:3209)

Am I doing something wrong in my code? Or is this perhaps a bug or limitation in the libraries I'm using? Based on REPL experimentation, it seems like the code I'm using is lazy. Underneath, Clojure uses a SAX XML parser, so that alone should not be the problem.

See also:

Update 2013-04-30:

I'd like to share some discussion from the clojure IRC channel. I've pasted an edited version below. (I removed the user names, but if you want credit, just let me know; I'll edit and give you a link.)

The entire tag is read into memory at once in xml/parse, long before you even call count. And clojure.xml uses the ~lazy SAX parser to produce an eager concrete collection. Processing XML lazily requires a lot more work than you think - and it would be work you do, not some magic clojure.xml could do for you. Feel free to disprove by calling (count (xml/parse data-whatever)).

To summarize, even before using zip/xml-zip, this xml/parse causes an OutOfMemoryError with a large enough file:

(count (xml/parse filename))

At present, I am exploring other XML processing options. At the top of my list is clojure.data.xml as mentioned at https://stackoverflow.com/a/9946054/109618.

解决方案

It's a limitation of the zipper data structure. Zippers are designed for efficiently navigating trees of various sorts, with support for moving up/down/left/right in the tree hierarchy, with in-place edits in near-constant time.

From any position in the tree, the zipper needs to be able to re-construct the original tree (with edits applied). To do that, it keeps track of the current node, the parent node, and all siblings to the left and right of the current node in the tree, making heavy use of persistent data structures.

The filter functions that you're using start at the left-most child of a node and work their way one-by-one to the right, testing predicates along the way. The zipper for the left-most child starts out with an empty vector for its left-hand siblings (note the :l [] part in the source for zip/down). Each time you move right, it will add the last node visited to the vector of left-hand siblings (:l (conj l node) in zip/right). By the time you arrive at the right-most child, you've built up an in-memory vector of all the nodes in that level in the tree, which, for a wide tree like yours, could cause an OOM error.

As a workaround, if you know that the top-level element is just a container for a list of <page> elements, I'd suggest using the zipper to navigate within the page elements and just use map to process the pages:

(defn titles
  "Extract titles from +filename+"
  [filename]
  (let [xml (xml/parse filename)]
    (map #(xml-> (zip/xml-zip %) :title text)
         (:content xml))))

So, basically, we're avoiding using the zip abstraction for the top level of the overall xml input, and thusly avoid its holding the entire xml in memory. This implies that for even huger xml, where each first-level child is huge, we may have to skip using the zipper once again in the second level of the XML structure, and so on...

这篇关于在使用data.zip解析Clojure中的XML时出现OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆