处理格式错误的XML [英] Dealing with malformed XML

查看:129
本文介绍了处理格式错误的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理perl中格式错误的XML,该XML是由我无法更改的上游进程生成的(似乎在这里是一个常见问题).但是,据我所知,XML仅以一种特殊的方式格式不正确:它具有包含未转义的小于号的属性值,例如:

I'm dealing with malformed XML in perl that's generated by an upstream process that I can't change (seems like this is a common problem here). However, as far as I've seen, the XML is malformed in only one particular way: it has attribute values that contain unescaped less-than signs, e.g.:

<tag v="< 2">

我正在将Perl与 XML :: LibXML 一起使用来进行解析,当然,产生解析错误.我已经尝试过使用restore选项,该选项允许我进行解析,但是遇到第一个解析错误时它只会停止,因此我会以这种方式丢失数据.

I'm using perl with XML::LibXML to parse, and this, of course, generates parse errors. I've tried using the recover option, which allows me to parse, but it simply stops when it encounters the first parse error, so I'm losing data that way.

似乎我有两个一般选择:

It seems like I have two general choices:

  1. 在解析输入XML之前,也许使用正则表达式将其固定.
  2. 找到更宽容的XML解析器.

我倾向于选项1,因为我想捕获XML的任何其他错误.你会推荐什么?如果是#1,有人可以指导我进行正则表达式方法吗?

I'm leaning towards option 1, as I'd like to catch any other errors with the XML. What would you recommend? If #1, can someone guide me through the regex approach?

推荐答案

我知道这不是您想要的答案-但是XML规范非常清晰和严格.

I know this isn't the answer you want - but the XML spec is quite clear and strict.

格式错误的XML是致命的.

如果它在验证器中不起作用,那么您的代码甚至不应尝试修复"它,而不仅仅是尝试自动修复"某些程序代码.

If it doesn't work in a validator, then your code should not even attempt to "fix" it, any more than you'd try and automatically 'fix' some program code.

来自注释的XML规范:

致命错误 [定义:]符合标准的XML处理器必须检测到并报告给应用程序的错误.在遇到致命错误之后,处理器可以继续处理数据以搜索其他错误,并且可以将此类错误报告给应用程序.为了支持对错误的更正,处理器可以使来自文档的未经处理的数据(具有混合的字符数据和标记)可供应用程序使用.但是,一旦检测到致命错误,处理器就不能继续正常处理(即,不能继续以正常方式将字符数据和有关文档逻辑结构的信息传递给应用程序).

fatal error [Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

特别是关于原因的评论:"Draconian"错误处理

And specifically the commentary on why: "Draconian" error-handling

我们希望XML使程序员能够编写可在Web上传输并在大量台式机上执行的代码.但是,如果此代码必须包含针对各种草率的最终用户实践的错误处理,则它的大小必然会膨胀到一定程度,例如Netscape Navigator或Microsoft Internet Explorer,其大小为数十兆字节,因此达不到目的.

We want XML to empower programmers to write code that can be transmitted across the Web and execute on a large number of desktops. However, if this code must include error-handling for all sorts of sloppy end-user practices, it will of necessity balloon in size to the point where it, like Netscape Navigator, or Microsoft Internet Explorer, is tens of megabytes in size, thus defeating the purpose.

如果您曾经尝试将HTML的解析器组合在一起,您将意识到为什么需要采用这种方式-您最终为边缘情况,不良标签嵌套,隐式标签关闭编写了很多处理程序,以至于您的代码从一开始就是一团糟.

If you've ever tried to put together a parser for HTML, you'll realise why it needs to be this way - you end up writing SO MANY handlers for edge cases, bad tag nestings, implict tag closure that your code is a mess right from the start.

并且因为这是我最喜欢的关于Stack Overflow的帖子-这是为什么的示例:

And because it's my favourite post on Stack Overflow - here is an example of why: RegEx match open tags except XHTML self-contained tags

现在,我意识到这并不总是一种选择,如果要求上游的修复XML"是阻力最小的途径,那么您可能不会来这里.但是,我仍然敦促您将其报告为XML原始应用程序中的缺陷,并尽可能抵制以编程方式进行修复"的压力-因为正如您正确认识到的那样,当正确的答案是从源头上解决问题".

Now I appreciate this isn't always an option, and you probably wouldn't come here if asking your upstream 'fix your XML' was the path of least resistance. However I would still urge you to report it as defect in the XML originating application and as much as possible resist pressure to 'fix' programatically - because as you've rightly figured out, it's building yourself a world of pain when the right answer is 'fix the problem at source'.

如果您真的迷上了这条路,可以- SinanÜnür >指出-您唯一的选择是捕获解析器失败的位置,然后进行检查并尝试修复.但是您找不到一个可以为您做的XML解析器,因为定义的解析器 破了.

If you are really stuck on this road, you can - as Sinan Ünür points out - your only option is to trap where you parser failed, and then inspect and try to repair as you go. But you won't find an XML parser that'll do it for you, because the one that do are by definition broken.

首先,我会建议:

  • 挖出该规范的副本,以显示给要求您执行此操作的人.
  • 向他们指出,我们拥有标准的全部原因是为了促进互操作性.
  • 因此,通过故意执行 违反标准的行为,您将承担业务风险-您正在创建可能有一天神秘地破解的代码,因为正在使用正则表达式或自动修复之类的东西来构建在一系列可能不成立的假设中.
  • 这里有用的概念是技术债务-解释您是在自动产生技术债务修复,这实际上不是您的问题.
  • 然后问他们是否愿意承担这种风险.
  • 如果他们确实认为这是可以接受的风险,那么就继续尝试-您可能会发现它值得-有效-忽略了您的源数据看起来像 XML的事实,并将其视为是纯文本-使用正则表达式提取相关的数据行等.
  • 在对将来的维护程序员的评论中道歉,解释谁做出决定以及原因.
  • Dig out a copy of the spec, to show to whoever's asked you to do this.
  • point out to them that the whole reason we have standards is to promote interoperability.
  • Therefore that by doing something that deliberately violates the standard, you are taking a business risk - you are creating code that may one day mysteriously break, because using things like regular expressions or automatic fixing is building in a set of assumptions that may not hold true.
  • A useful concept here is technical debt - explain you're incurring technical debt by automatic fixing, for something that's really not your problem.
  • Then ask them if they wish to accept that risk.
  • If they do think that's an acceptable risk, then just get on with it - you may find it worth - effectively - ignoring the fact that your source data looks like XML and treat it as if it were plain text - use regular expressions to extract pertinent data lines, etc.
  • Stick an apology in the comments to your future maintenance programmer, explaining who made the decision and why.

也可以用作参考点: 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆