如何解析无效(错误/不良格式)的XML? [英] How to parse invalid (bad / not well-formed) XML?
问题描述
目前,我正在研究一项涉及解析我们从其他产品收到的XML的功能。我决定针对一些实际的客户数据运行一些测试,看起来其他产品允许来自用户的输入被认为是无效的。无论如何,我仍然需要尝试找出解析它的方法。我们正在使用 javax.xml.parsers.DocumentBuilder
,我收到的输入错误如下所示。
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder
and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
正如您所知,描述中的内容似乎是无效的标签(< THIS-IS-PART-OF-描述>
)。现在,已知此描述标记是叶标记,并且不应在其中包含任何嵌套标记。无论如何,这仍然是一个问题,并在上产生例外.DocumentBuilder.parse(...)
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>
). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
我知道这是无效的XML,但它可以预测无效。有关解析此类输入的方法的任何想法吗?
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
推荐答案
XML比无效更糟糕&ndash的;它的形式不正确;请参阅 格式正确与有效XML 。
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
对违法行为可预测性的非正式评估无济于事。该文本数据不是XML。没有符合条件的XML工具或库可以帮助您处理它。
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
- 请提供商解决问题。 需要格式良好的XML。(从技术上讲,格式良好的XML 这个短语是多余的,但可能对重点有用。)
-
使用容错标记解析器在解析为XML之前清除问题:
- Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
-
独立: xmlstarlet 具有强大的恢复和修复功能 credit: RomanPerekhrest
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
独立和C: HTML Tidy 也适用于XML。
.NET:
-
XmlReaderSettings.CheckCharacters 可以禁用
来解决非法的XML字符问题。
XmlReaderSettings.CheckCharacters can be disabled to get past illegal XML character problems.
@jdweng报告 XmlReader.ReadToFollowing()
有时可以使用
来解决XML语法问题,但是请注意以下#3中的
规则破坏警告。
@jdweng reports that XmlReader.ReadToFollowing()
can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
使用文本编辑器手动处理数据或使用字符/字符串函数以编程方式
。以编程方式执行此
的范围可以是棘手到不可能,因为
似乎是
可预测通常不是 - 规则中断很少受规则约束。
Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.
- 对于无效字符错误,请使用正则表达式删除/替换无效字符:
- PHP:
preg_replace(' / [^ \x {0009} \x {000A} \x {000D} \x {0020} -\x {D7FF} \x {E000} -\x {FFFD}] + /你','',$ s);
- Ruby:
string.tr(^ \u {0009} \ u {000a} \ u { 000d} \ u {0020} -\ {D7FF} \ u {E000} -\u {FFFD},'')
- JavaScript:
inputStr.replace(/ [^ \ x09 \ x0A \ x0D \ x20-\ xFF \ x85 \ xA0-\\\DD7FF \ uE000-\\\﷏ \ uFDE0-\ uFFFD] / gm,'')
- For invalid character errors, use regex to remove/replace invalid characters:
- PHP:
preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
- Ruby:
string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
- JavaScript:
inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
对于&符号,请使用正则表达式替换匹配
& amp;
: credit: blhsin ,演示For ampersands, use regex to replace matches with
&
: credit: blhsin, demo&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
- PHP:
- PHP:
请注意,上述正则表达式不会将评论或CDATA
部分考虑在内。
Note that the above regular expressions won't take comments or CDATA sections into account.
这篇关于如何解析无效(错误/不良格式)的XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!