处理格式错误的 XML [英] Dealing with malformed XML

查看:18
本文介绍了处理格式错误的 XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理由无法更改的上游进程生成的 perl 格式错误的 XML(这似乎是这里的常见问题).但是,就我所见,XML 仅以一种特定方式格式错误:它的属性值包含未转义的小于号,例如:

I'm dealing with malformed XML in perl that's generated by an upstream process that I can't change (seems like this is a common problem here). However, as far as I've seen, the XML is malformed in only one particular way: it has attribute values that contain unescaped less-than signs, e.g.:

<tag v="< 2">

我使用 perl 和 XML::LibXML 来解析,当然,这产生解析错误.我试过使用恢复选项,它允许我进行解析,但它只是在遇到第一个解析错误时停止,所以我正在以这种方式丢失数据.

I'm using perl with XML::LibXML to parse, and this, of course, generates parse errors. I've tried using the recover option, which allows me to parse, but it simply stops when it encounters the first parse error, so I'm losing data that way.

似乎我有两个一般选择:

It seems like I have two general choices:

  1. 在解析之前修复输入 XML,可能使用正则表达式.
  2. 寻找更宽容的 XML 解析器.

我倾向于选项 1,因为我想用 XML 捕获任何其他错误.你会推荐什么?如果#1,有人可以指导我完成正则表达式方法吗?

I'm leaning towards option 1, as I'd like to catch any other errors with the XML. What would you recommend? If #1, can someone guide me through the regex approach?

推荐答案

我知道这不是您想要的答案 - 但 XML 规范非常清晰和严格.

I know this isn't the answer you want - but the XML spec is quite clear and strict.

格式错误的 XML 是致命的.

如果它在验证器中不起作用,那么您的代码甚至不应该尝试修复"它,就像您尝试自动修复"某些程序代码一样.

If it doesn't work in a validator, then your code should not even attempt to "fix" it, any more than you'd try and automatically 'fix' some program code.

来自 带注释的 XML 规范:

致命错误[定义:] 符合标准的 XML 处理器必须检测并报告给应用程序的错误.遇到致命错误后,处理器可能会继续处理数据以搜索进一步的错误,并且可能会将此类错误报告给应用程序.为了支持纠错,处理器可以使应用程序可以使用来自文档的未处理数据(带有混合的字符数据和标记).但是,一旦检测到致命错误,处理器就不能继续正常处理(即,它不能继续以正常方式将字符数据和有关文档逻辑结构的信息传递给应用程序).

fatal error [Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

特别是关于原因的评论:Draconian"错误处理

And specifically the commentary on why: "Draconian" error-handling

我们希望通过 XML 使程序员能够编写可通过 Web 传输并在大量桌面上执行的代码.但是,如果此代码必须包含针对各种草率最终用户实践的错误处理,则它的大小必然会膨胀到它的大小,例如 Netscape Navigator 或 Microsoft Internet Explorer,大小为数十兆字节,因此违背了目的.

We want XML to empower programmers to write code that can be transmitted across the Web and execute on a large number of desktops. However, if this code must include error-handling for all sorts of sloppy end-user practices, it will of necessity balloon in size to the point where it, like Netscape Navigator, or Microsoft Internet Explorer, is tens of megabytes in size, thus defeating the purpose.

如果您曾经尝试为 HTML 构建一个解析器,您就会意识到为什么需要这样 - 您最终会为边缘情况、错误的标签嵌套、隐含的标签闭包编写这么多处理程序,以至于您的代码从一开始就是一团糟.

If you've ever tried to put together a parser for HTML, you'll realise why it needs to be this way - you end up writing SO MANY handlers for edge cases, bad tag nestings, implict tag closure that your code is a mess right from the start.

因为它是我最喜欢的 Stack Overflow 上的帖子 - 这是一个原因:正则表达式匹配除 XHTML 自包含标签之外的开放标签

And because it's my favourite post on Stack Overflow - here is an example of why: RegEx match open tags except XHTML self-contained tags

现在我明白这并不总是一种选择,如果要求上游修复您的 XML"是阻力最小的路径,您可能不会来这里.但是,我仍然会敦促您将其报告为 XML 原始应用程序中的缺陷,并尽可能抵制以编程方式修复"的压力 - 因为正如您正确地发现的那样,当 正确的答案是从源头上解决问题".

Now I appreciate this isn't always an option, and you probably wouldn't come here if asking your upstream 'fix your XML' was the path of least resistance. However I would still urge you to report it as defect in the XML originating application and as much as possible resist pressure to 'fix' programatically - because as you've rightly figured out, it's building yourself a world of pain when the right answer is 'fix the problem at source'.

如果你真的被困在这条路上,你可以 - 作为 Sinan Ünür 指出 - 您唯一的选择是捕获解析器失败的地方,然后检查并尝试修复.但是您不会找到可以为您完成此操作的 XML 解析器,因为该解析器根据定义已损坏.

If you are really stuck on this road, you can - as Sinan Ünür points out - your only option is to trap where you parser failed, and then inspect and try to repair as you go. But you won't find an XML parser that'll do it for you, because the one that do are by definition broken.

我会建议首先你:

  • 挖出一份规范副本,向要求您执行此操作的任何人展示.
  • 向他们指出我们制定标准的全部原因是为了促进互操作性.
  • 因此,通过做一些故意违反标准的事情,您正在承担业务风险 - 您正在创建的代码有一天可能会神秘地破坏,因为使用正则表达式或自动修复之类的东西正在构建在一组可能不成立的假设中.
  • 这里一个有用的概念是技术债务 - 说明您自动承担技术债务修复,对于那些真的不是你的问题的东西.
  • 然后询问他们是否愿意接受这种风险.
  • 如果他们确实认为这是一个可以接受的风险,那么就继续下去 - 您可能会发现它值得 - 有效地 - 忽略您的源数据看起来像 XML的事实并将其视为是纯文本 - 使用正则表达式提取相关数据行等.
  • 在评论中向您未来的维护程序员道歉,解释是谁做出的决定以及原因.
  • Dig out a copy of the spec, to show to whoever's asked you to do this.
  • point out to them that the whole reason we have standards is to promote interoperability.
  • Therefore that by doing something that deliberately violates the standard, you are taking a business risk - you are creating code that may one day mysteriously break, because using things like regular expressions or automatic fixing is building in a set of assumptions that may not hold true.
  • A useful concept here is technical debt - explain you're incurring technical debt by automatic fixing, for something that's really not your problem.
  • Then ask them if they wish to accept that risk.
  • If they do think that's an acceptable risk, then just get on with it - you may find it worth - effectively - ignoring the fact that your source data looks like XML and treat it as if it were plain text - use regular expressions to extract pertinent data lines, etc.
  • Stick an apology in the comments to your future maintenance programmer, explaining who made the decision and why.

作为参考点也可能有用:XML 文件中不应将哪个字符设置为值

Also might be useful as a reference point: Which character should not be set as values in XML file

这篇关于处理格式错误的 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆