Scala 中的容错 XML 解析 [英] Error-tolerant XML parsing in Scala

查看:35
本文介绍了Scala 中的容错 XML 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够解析不一定格式正确的 XML.例如,我会寻找一个模糊而不是严格的解析器,能够从严重嵌套的标签中恢复.我可以自己写,但值得先在这里问一下.

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.

更新:

我要做的是从 HTML 中提取链接和其他信息.在格式良好的 XML 的情况下,我可以使用 Scala XML API.在格式错误的 XML 的情况下,以某种方式将其转换为正确的 XML(以某种方式)并以相同的方式处理它会很好,否则我将不得不使用两组完全不同的函数来处理文档.

What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.

显然因为输入的格式不正确,而我正在尝试创建一个格式良好的树,因此必须涉及一些启发式方法(例如当您看到 ;</parent> 你应该先关闭 ,然后当你看到一个 时你忽略它).但这当然不是正确的语法,因此没有正确的方法.

Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

推荐答案

您要找的不是 XML 解析器.XML 对嵌套、关闭等非常严格.其他答案之一建议 Tag Soup.这是一个很好的建议,尽管从技术上讲它更接近词法分析器而不是解析器.如果您想要的 XML-ish 内容只是一个没有任何验证的事件流,那么推出您自己的解决方案几乎是微不足道的.只需遍历输入,沿途消费与正则表达式匹配的内容(这正是 Tag Soup 所做的).

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).

问题在于词法分析器无法从解析器中提供您想要的许多功能(例如,生成基于树的输入表示).您必须自己实现该逻辑,因为这样的宽松"解析器无法确定如何处理以下情况:

The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:

<parent>
    <child>
    </parent>
</child>

想一想:什么样的树会期望摆脱这种困境?这个问题确实没有合理的答案,这正是解析器不会有太大帮助的原因.

Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.

现在,这并不是说您不能使用 Tag Soup(或您自己的手写词法分析器)基于此输入生成某种树结构,但实现将非常脆弱.对于像 XML 这样的面向树的格式,你真的别无选择,只能严格,否则几乎不可能得到合理的结果(这就是浏览器在兼容性方面如此困难的部分原因).

Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

这篇关于Scala 中的容错 XML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆