使用 libxml-ruby 逐块处理大型 XML 文件 [英] Processing large XML file with libxml-ruby chunk by chunk

查看:58
本文介绍了使用 libxml-ruby 逐块处理大型 XML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想阅读一个包含超过一百万个小文件的大型 XML 文件在 Ruby 中使用 libxml 的书目记录(如

...
).我已尝试将 Reader 类与 expand 方法结合使用以逐条读取记录,但我不确定这是正确的方法,因为我的代码占用了内存.因此,我正在寻找一个方法,如何在内存使用量恒定的情况下通过记录方便地处理记录.下面是我的主循环:

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

这里的关键是 dblp.expand 读取整个子树(如

记录)并将其作为参数传递给工厂进行进一步处理.这是正确的方法吗?

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

在工厂方法中,我使用类似于 XPath 的高级表达式来提取元素的内容,如下所示.再说一遍,这可行吗?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

推荐答案

在处理大型 XML 文件时,您应该使用流解析器来避免将所有内容加载到内存中.有两种常见的方法:

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

  • 推送解析器就像 SAX,当你得到它们时你会对遇到的标签做出反应(参见 tadman 答案).
  • 拉取解析器,您可以在其中控制 XML 文件中的光标",您可以使用简单的原语(例如向上/向下等)移动光标.
  • Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
  • Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

我认为如果你只想检索一些字段,推送解析器很好用,但它们通常用于复杂的数据提取很麻烦,并且经常使用 case... when... 构造

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

在我看来,拉解析器是基于树的模型和推解析器之间的一个不错的选择.您可以在 Dobb 博士的期刊中找到一篇不错的文章,关于使用 REXML 拉解析器.

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

这篇关于使用 libxml-ruby 逐块处理大型 XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆