对于 2.4MB XML 文件,您会推荐哪个 Ruby XML 库? [英] Which Ruby XML library would you recommend for a 2.4MB XML file?

查看:30
本文介绍了对于 2.4MB XML 文件,您会推荐哪个 Ruby XML 库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 2.4 MB 的 XML 文件,它是从 Microsoft Project 导出的(嘿,我是这里的受害者!),我被要求从中提取某些详细信息以重新演示.忽略请求的智能或其他方面,从 Ruby 的角度我应该首先尝试哪个库?

I have a 2.4 MB XML file, an export from Microsoft Project (hey I'm the victim here!) from which I am requested to extract certain details for re-presentation. Ignoring the intelligence or otherwise of the request, which library should I try first from a Ruby perspective?

我知道以下内容(排名不分先后):

I'm aware of the following (in no particular order):

我更喜欢打包成 Ruby gem 的东西,我怀疑 Chilkat 库不是.

I'd prefer something packaged as a Ruby gem, which I suspect the Chilkat library is not.

性能不是主要问题 - 我不希望该设备每天需要运行多次(每周一次更有可能).我对任何与 XML 相关的东西一样易于使用的东西更感兴趣.

Performance isn't a major issue - I don't expect the thing to need to run more than once a day (once a week is more likely). I'm more interested in something that's as easy to use as anything XML-related is able to get.

我尝试了宝石化的:

hpricot 是最简单的国家英里.例如,要提取此 XML 中 SaveVersion 标记的内容(保存在名为test.xml"的文件中)

hpricot is, by a country mile, easiest. For example, to extract the content of the SaveVersion tag in this XML (saved in a file called, say 'test.xml')

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Project xmlns="http://schemas.microsoft.com/project">
    <SaveVersion>12</SaveVersion>
</Project>

需要这样的东西:

doc = Hpricot.XML(open('test.xml'))
version = (doc/:Project/:SaveVersion).first.inner_html

hpricot 似乎对命名空间相对不关心,在这个例子中这很好:只有一个,但对于复杂的文档可能会出现问题.由于hpricot也很慢,我宁愿想象这会是一个自行解决的问题.

hpricot seems to be relatively unconcerned with namespaces, which in this example is fine: there's only one, but would potentially be a problem with a complex document. Since hpricot is also very slow, I rather imagine this would be a problem that solves itself.

libxml-ruby 速度快了一个数量级,理解命名空间(我花了好几个小时才弄明白这一点)并且完全更接近 XML 金属 - XPath 查询和所有其他东西都在那里.如果像我一样只在极端胁迫的情况下打开 XML 文档,这不一定是一件好事.helper 模块在提供如何有效处理默认命名空间的示例方面非常有用.这大致就是我最终的结果(我并没有以任何方式断言它的美感、正确性或其他价值,这就是我现在所处的位置):

libxml-ruby is an order of magnitude faster, understands namespaces (it took me a good couple of hours to figure this out) and is altogether much closer to the XML metal - XPath queries and all the other stuff are in there. This is not necessarily a Good Thing if, like me, you open up an XML document only under conditions of extreme duress. The helper module was mostly helpful in providing examples of how to handle a default namespace effectively. This is roughly what I ended up with (I'm not in any way asserting its beauty, correctness or other value, it's just where I am right now):

xml_parser = XML::Parser.new
xml_parser.string = File.read(path)
doc = xml_parser.parse
@root = doc.root
@scopes = { :in_node => '', :in_root => '/', :in_doc => '//' }
@ns_prefix = 'p'
@ns = "#{@ns_prefix}:#{@root.namespace[0].href}"
version = @root.find_first(xpath_qry("Project/SaveVersion", :in_root), @ns).content.to_i

def xpath_qry(tags, scope = :in_node)
  "#{@scopes[scope]}" + tags.split(/\//).collect{ |tag| "#{@ns_prefix}:#{tag}"}.join('/')
end

我仍在争论利弊:libxml 的额外严谨性,hpricot 的 _why 代码的纯粹风格.

I'm still debating the pros and cons: libxml for its extra rigour, hpricot for the sheer style of _why's code.

稍后再次我发现了 HappyMapper ('gem install happymapper'),它非常有前途,即使仍处于早期阶段.它是声明性的并且大部分都有效,尽管我发现了一些我还没有修复的边缘情况.它可以让你做这样的事情,解析我的谷歌阅读器 OPML:

EDIT again, somewhat later: I discovered HappyMapper ('gem install happymapper') which is hugely promising, if still at an early stage. It's declarative and mostly works, although I have spotted a couple of edge cases that I don't have fixes for yet. It lets you do stuff like this, which parses my Google Reader OPML:

module OPML
  class Outline
    include HappyMapper
    tag 'outline'
    attribute :title, String
    attribute :text, String
    attribute :type, String
    attribute :xmlUrl, String
    attribute :htmlUrl, String
    has_many :outlines, Outline
  end
end

xml_string = File.read("google-reader-subscriptions.xml")

sections = OPML::Outline.parse(xml_string)

我已经爱上它了,尽管它还不完美.

I already love it, even though it's not perfect yet.

推荐答案

Hpricot可能是最适合您的工具——它易于使用并且应该可以毫无问题地处理 2mg 文件.

Hpricot is probably the best tool for you -- it is easy to use and should handle 2mg file with no problem.

Speedwise libxml 应该是最好的.几个月前我为 python 使用了 libxml2 绑定(那时 rb-libxml 已经过时了).流媒体界面对我来说效果最好(LibXML::XML::Reader 在 ruby​​ gem 中).它允许在下载时处理文件,比 SAX 更加用户友好,并且允许我在 1 分钟多一点的时间内将 30mb xml 文件中的数据从 Internet 加载到 MySQL 数据库.

Speedwise libxml should be the best. I used libxml2 binding for python few months ago (at that moment rb-libxml was stale). Streaming interface worked the best for me (LibXML::XML::Reader in ruby gem). It allows to process file while it is downloading, is a bit more userfriendly than SAX and allowed me to load data from 30mb xml file from internet to a MySQL database in a little more than a minute.

这篇关于对于 2.4MB XML 文件,您会推荐哪个 Ruby XML 库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆