编写HTML解析器 [英] Writing an HTML Parser

查看：47 发布时间：2021/5/14 19:35:37 html parsing html-parsing

本文介绍了编写HTML解析器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在尝试(或计划尝试)编写一个简单(尽可能)的程序，以将html文档解析为树.

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

在谷歌搜索之后，我发现很多答案都说不要做，它已经完成了"(或类似的词)；以及对HTML解析器示例的引用；还有一篇相当强调的文章，介绍了为什么不应该使用正则表达式.但是，我还没有找到有关编写解析器的正确"方法的任何指南.(顺便说一下，这是我作为学习运动所尝试的事情，而不是任何事情，因此我很想这样做，而不是使用预制的东西)

After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)

我相信，只要阅读文档并将标签/文本等添加到树中，每当我碰到一个接近的标签时，就可以提高工作水平(再次，简单，不需要花哨的线程或提高效率)，我就能使XML解析器正常工作在这个阶段.).但是，对于HTML而言，并非所有标签都已关闭.

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

所以我的问题是:您会推荐什么作为处理此问题的方法?我唯一的想法是以与XML类似的方式处理它，但是有一个标签列表，这些标签不一定每个都带有关闭条件(例如，结尾于或next).标签).

So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. ends on or next tag).

还有其他建议(希望更好)吗?有没有更好的方法可以完全做到这一点?

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

推荐答案

所以，我会在这里尝试答案-

so, I'll try for an answer here -

基本上，使普通" html解析(此处不是在谈论有效的xhtml)与xml解析不同的原因是，诸如无休止的< img> 标记之类的规则负载，或者严格来说，即使所有html标记中最草率的事实在某种程度上也会在浏览器中呈现.您将需要一个验证器以及解析器来构建您的树.但是，您必须决定要支持的HTML标准，这样，当您发现标记中的弱点时，就会知道这是一个错误，而不仅仅是草率的html.

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

了解所有规则，构建一个验证器，然后就可以构建一个解析器.那是计划A.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

计划B是为了允许解析器中具有一定的抗错误性，这将使验证步骤变得不必要.例如，解析所有标签，并将它们放在列表中，忽略任何属性，以便您可以轻松地在列表上操作，确定标签是保持打开状态还是根本没有打开，从而最终获得良好的"布局树，这将是草率布局的一种近似解决方案，同时对于正确的布局也是精确的解决方案.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

希望有帮助！

这篇关于编写HTML解析器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

编写HTML解析器 [英] Writing an HTML Parser

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

编写HTML解析器 [英] Writing an HTML Parser

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭