Scala 中的容错 XML 解析 [英] Error-tolerant XML parsing in Scala

查看：35 发布时间：2021/10/1 20:25:40 java xml scala

本文介绍了Scala 中的容错 XML 解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望能够解析不一定格式正确的 XML.例如，我会寻找一个模糊而不是严格的解析器，能够从严重嵌套的标签中恢复.我可以自己写，但值得先在这里问一下.

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.

更新:

我要做的是从 HTML 中提取链接和其他信息.在格式良好的 XML 的情况下，我可以使用 Scala XML API.在格式错误的 XML 的情况下，以某种方式将其转换为正确的 XML(以某种方式)并以相同的方式处理它会很好，否则我将不得不使用两组完全不同的函数来处理文档.

What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.

显然因为输入的格式不正确，而我正在尝试创建一个格式良好的树，因此必须涉及一些启发式方法(例如当您看到 ;</parent> 你应该先关闭，然后当你看到一个时你忽略它).但这当然不是正确的语法，因此没有正确的方法.

Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).

问题在于词法分析器无法从解析器中提供您想要的许多功能(例如，生成基于树的输入表示).您必须自己实现该逻辑，因为这样的宽松"解析器无法确定如何处理以下情况:

The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:

<parent>
    <child>
    </parent>
</child>

想一想:什么样的树会期望摆脱这种困境?这个问题确实没有合理的答案，这正是解析器不会有太大帮助的原因.

Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.

现在，这并不是说您不能使用 Tag Soup(或您自己的手写词法分析器)基于此输入生成某种树结构，但实现将非常脆弱.对于像 XML 这样的面向树的格式，你真的别无选择，只能严格，否则几乎不可能得到合理的结果(这就是浏览器在兼容性方面如此困难的部分原因).

Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

这篇关于Scala 中的容错 XML 解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scala 中的容错 XML 解析 [英] Error-tolerant XML parsing in Scala

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Scala 中的容错 XML 解析 [英] Error-tolerant XML parsing in Scala

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭