如何在 ruby​​ 中整理格式错误的 xml [英] How to tidy up malformed xml in ruby

查看:58
本文介绍了如何在 ruby​​ 中整理格式错误的 xml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在整理从 SEC 的 edgar 数据库返回的格式错误的 XML 代码时遇到问题.

出于某种原因,他们形成了可怕的 xml.包含任何类型字符串的标签不会关闭,它实际上可以在其他标签中包含其他 xml 或 html 文档.通常我会为了 Tidy 而这样做,但没有维护.

For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.

我尝试过使用 Nokogiri::XML::SAX::Parser 但这似乎很困难,因为标签没有关闭.它似乎工作正常,直到它碰到第一个结束标签,然后它不再对它们进行触发.但它吐出了正确的字符.

I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.

  class Filing < Nokogiri::XML::SAX::Document
    def start_element name, attrs = []
      puts "starting: #{name}"
    end

    def characters str
      puts "chars: #{str}"
    end

    def end_element name
      puts "ending: #{name}"
    end
  end

这似乎是最好的选择,因为我可以简单地让它忽略其他 xml 或 html 文档.这也是最有意义的,因为其中一些文档可能会变得非常大,因此将整个 dom 存储在内存中可能行不通.

It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.

以下是一些示例文件:1 2 3

Here are some example files: 1 2 3

我开始认为我只需要编写自己的自定义解析器

I'm starting to think I'll just have to write my own custom parser

推荐答案

Nokogiri 的正常 DOM 模式能够自动修复 XML,使其在语法上是正确的,或者是语法正确的复制品.它有时会感到困惑并会移动结束标记,但您可以对文件进行预处理,以便在需要时将其推向正确的方向.

Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.

我将 XML #1 保存到一个文档并加载它:

I saved the XML #1 out to a document and loaded it:

require 'nokogiri'

doc = ''
File.open('./test.xml') do |fi|
  doc = Nokogiri::XML(fi)
end

puts doc.to_xml

解析后,您可以检查 Nokogiri::XML::Document 实例的 errors 方法以查看生成了哪些错误,以获得反常的乐趣.

After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.

doc.errors

如果使用 Nokogiri 的 DOM 模型还不够好,您是否考虑过使用 XMLLint 进行预处理和清理数据,发出干净的 XML 以便 SAX 可以工作?它的 --recover 选项可能有用.

If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.

xmllint --recover test.xml

它会在 stderr 上输出错误,在 stdout 上输出代码,因此您可以轻松地将其通过管道传输到另一个文件.

It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.

至于编写自己的解析器...为什么?您还有其他可用的选择,而重新发明一个实施良好的轮子并不是很好地利用时间.

As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

这篇关于如何在 ruby​​ 中整理格式错误的 xml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆