如何解析无效(错误/格式不正确)的 XML? [英] How to parse invalid (bad / not well-formed) XML?

查看:28
本文介绍了如何解析无效(错误/格式不正确)的 XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在开发一项功能,该功能涉及解析我们从其他产品接收到的 XML.我决定对一些实际的客户数据进行一些测试,看起来其他产品允许来自用户的输入应该被认为是无效的.无论如何,我仍然必须尝试找出解析它的方法.我们正在使用 javax.xml.parsers.DocumentBuilder 并且我在输入中遇到如下错误.

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

如您所知,描述中似乎包含一个无效标签().现在,这个描述标签被认为是一个叶子标签,里面不应该有任何嵌套的标签.无论如何,这仍然是一个问题,并在 DocumentBuilder.parse(...)

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

我知道这是无效的 XML,但可以预见它是无效的.关于解析此类输入的任何想法?

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

推荐答案

那个XML"比无效更糟糕——它格式不正确;请参阅格式良好与有效 XML.

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

对违规的可预测性进行非正式评估无济于事.该文本数据不是 XML.没有符合标准的 XML 工具或库可以帮助您处理它.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

  1. 让供应商解决问题.需要格式良好的 XML.(从技术上讲,格式良好的 XML 是多余的,但可能有助于强调.)

  1. Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

使用容忍标记解析器在解析为 XML 之前清理问题:

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

  • Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

  • 独立和 C/C++: HTML Tidy 有效也可以使用 XML.Taggle 是一个端口将 TagSoup 转换为 C++.

  • Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.

    Python: 美丽的汤 基于 Python.请参阅解析器之间的差异部分中的注释.另请参阅此问题的答案了解更多信息处理 Python 中格式不正确的标记的建议,尤其包括 lxml 的 recover=True 选项.另请参阅此答案,了解如何使用 codecs.EncodedFile() 清除非法字符.

    Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python, including especially lxml's recover=True option. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.

    Java: TagSoupJSoup 专注于 HTML.FilterInputStream 可以用于预处理清理.

    Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.

    .NET:

    • XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
    • @jdweng notes that XmlReaderSettings.ConformanceLevel can be set to ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
    • @jdweng also reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.
    • Microsoft.Language.Xml.XMLParser is said to be "error-tolerant".

    PHP: 参见 DOMDocument::$recoverlibxml_use_internal_errors(true).在此处查看不错的示例.

    PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.

    Ruby: Nokogiri 支持Gentle Well-成形性".

    Ruby: Nokogiri supports "Gentle Well-Formedness".

    R: 参见 htmlTreeParse() 用于 R 中的容错标记解析.

    R: See htmlTreeParse() for fault-tolerant markup parsing in R.

    Perl:参见 XML::Liberal,一个解析损坏的 XML 的超级自由的 XML 解析器".

    Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."

    使用文本编辑器手动将数据处理为文本或以编程方式使用字符/字符串函数.这样做以编程方式可以从棘手到不可能,因为看起来是什么可预测的往往不是——打破规则很少受规则约束.

    Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

    • 对于无效字符错误,使用正则表达式去除/替换无效字符:

    • For invalid character errors, use regex to remove/replace invalid characters:

    • PHP: preg_replace('/[^x{0009}x{000a}x{000d}x{0020}-x{D7FF}x{E000}-x{FFFD}]+/u', ' ', $s);
    • Ruby: string.tr("^u{0009}u{000a}u{000d}u{0020}-u{D7FF}u{E000‌ }-u{FFFD}", ' ')
    • JavaScript: inputStr.replace(/[^x09x0Ax0Dx20-xFFx85xA0-uD7FFuE000-uFDCFuFDE0-uFFFD]/gm, '')

    对于&符号,使用正则表达式将匹配替换为&amp;: 信用:blhsin演示

    For ampersands, use regex to replace matches with &amp;: credit: blhsin, demo

    &(?!(?:#d+|#x[0-9a-f]+|w+);)
    

  • 注意上面的正则表达式不会带注释或CDATA部分考虑在内.

    Note that the above regular expressions won't take comments or CDATA sections into account.

    这篇关于如何解析无效(错误/格式不正确)的 XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆