如何解析无效(错误/不良格式)的XML? [英] How to parse invalid (bad / not well-formed) XML?

查看:218
本文介绍了如何解析无效(错误/不良格式)的XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在研究一项涉及解析我们从其他产品收到的XML的功能。我决定针对一些实际的客户数据运行一些测试,看起来其他产品允许来自用户的输入被认为是无效的。无论如何,我仍然需要尝试找出解析它的方法。我们正在使用 javax.xml.parsers.DocumentBuilder ,我收到的输入错误如下所示。

Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.

<xml>
  ...
  <description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
  ...
</xml>

正如您所知,描述中的内容似乎是无效的标签(< THIS-IS-PART-OF-描述> )。现在,已知此描述标记是叶标记,并且不应在其中包含任何嵌套标记。无论如何,这仍然是一个问题,并在上产生例外.DocumentBuilder.parse(...)

As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)

我知道这是无效的XML,但它可以预测无效。有关解析此类输入的方法的任何想法吗?

I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?

推荐答案

XML比无效更糟糕&ndash的;它的形式不正确;请参阅 格式正确与有效XML

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

对违法行为可预测性的非正式评估无济于事。该文本数据不是XML。没有符合条件的XML工具或库可以帮助您处理它。

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.


  1. 请提供商解决问题。 需要格式良好的XML。(从技术上讲,格式良好的XML 这个短语是多余的,但可能对重点有用。)

  2. 使用容错标记解析器在解析为XML之前清除问题:

  1. Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
  2. Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

  • Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null


  • 独立和C: HTML Tidy 也适用于XML。

    .NET:

    • XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.

    @jdweng报告 XmlReader.ReadToFollowing() 有时可以使用
    解决XML语法问题,但是请注意以下#3中的
    规则破坏警告。

    @jdweng reports that XmlReader.ReadToFollowing() can sometimes be used to work-around XML syntactical issues, but note rule-breaking warning in #3 below.

    使用文本编辑器手动处理数据或使用字符/字符串函数以编程方式
    。以编程方式执行此
    的范围可以是棘手到不可能,因为
    似乎是
    可预测通常不是 - 规则中断很少受规则约束

    Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.


    • 对于无效字符错误,请使用正则表达式删除/替换无效字符:


      • PHP: preg_replace(' / [^ \x {0009} \x {000A} \x {000D} \x {0020} -\x {D7FF} \x {E000} -\x {FFFD}] + /你','',$ s);

      • Ruby: string.tr(^ \u {0009} \ u {000a} \ u { 000d} \ u {0020} -\ {D7FF} \ u {E000} -\u {FFFD},'')

      • JavaScript: inputStr.replace(/ [^ \ x09 \ x0A \ x0D \ x20-\ xFF \ x85 \ xA0-\\\DD7FF \ uE000-\\\﷏ \ uFDE0-\ uFFFD] / gm,'')

      • For invalid character errors, use regex to remove/replace invalid characters:
        • PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
        • Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌​}-\u{FFFD}", ' ')
        • JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')

        对于&符号,请使用正则表达式替换匹配& amp; credit: blhsin 演示

        For ampersands, use regex to replace matches with &amp;: credit: blhsin, demo

        &(?!(?:#\d+|#x[0-9a-f]+|\w+);)
        


      • 请注意,上述正则表达式不会将评论或CDATA
        部分考虑在内。

        Note that the above regular expressions won't take comments or CDATA sections into account.

        这篇关于如何解析无效(错误/不良格式)的XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆