使用 libxml-ruby 逐块处理大型 XML 文件 [英] Processing large XML file with libxml-ruby chunk by chunk

查看：58 发布时间：2021/7/11 21:04:08 ruby stream libxml-ruby

本文介绍了使用 libxml-ruby 逐块处理大型 XML 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想阅读一个包含超过一百万个小文件的大型 XML 文件在 Ruby 中使用 libxml 的书目记录(如

...

).我已尝试将 Reader 类与 expand 方法结合使用以逐条读取记录，但我不确定这是正确的方法，因为我的代码占用了内存.因此，我正在寻找一个方法，如何在内存使用量恒定的情况下通过记录方便地处理记录.下面是我的主循环:

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

这里的关键是 dblp.expand 读取整个子树(如

记录)并将其作为参数传递给工厂进行进一步处理.这是正确的方法吗?

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

在工厂方法中，我使用类似于 XPath 的高级表达式来提取元素的内容，如下所示.再说一遍，这可行吗?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

使用 libxml-ruby 逐块处理大型 XML 文件 [英] Processing large XML file with libxml-ruby chunk by chunk

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 libxml-ruby 逐块处理大型 XML 文件 [英] Processing large XML file with libxml-ruby chunk by chunk

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭