编写HTML解析器 [英] Writing an HTML Parser

查看:47
本文介绍了编写HTML解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试(或计划尝试)编写一个简单(尽可能)的程序,以将html文档解析为树.

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

在谷歌搜索之后,我发现很多答案都说不要做,它已经完成了"(或类似的词);以及对HTML解析器示例的引用;还有一篇相当强调的文章,介绍了为什么不应该使用正则表达式.但是,我还没有找到有关编写解析器的正确"方法的任何指南.(顺便说一下,这是我作为学习运动所尝试的事情,而不是任何事情,因此我很想这样做,而不是使用预制的东西)

After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)

我相信,只要阅读文档并将标签/文本等添加到树中,每当我碰到一个接近的标签时,就可以提高工作水平(再次,简单,不需要花哨的线程或提高效率),我就能使XML解析器正常工作在这个阶段.).但是,对于HTML而言,并非所有标签都已关闭.

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

所以我的问题是:您会推荐什么作为处理此问题的方法?我唯一的想法是以与XML类似的方式处理它,但是有一个标签列表,这些标签不一定每个都带有关闭条件(例如,< p>结尾于</p>或next).< p>标签).

So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. <p> ends on </p> or next <p> tag).

还有其他建议(希望更好)吗?有没有更好的方法可以完全做到这一点?

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

推荐答案

所以,我会在这里尝试答案-

so, I'll try for an answer here -

基本上,使普通" html解析(此处不是在谈论有效的xhtml)与xml解析不同的原因是,诸如无休止的< img> 标记之类的规则负载,或者严格来说,即使所有html标记中最草率的事实在某种程度上也会在浏览器中呈现.您将需要一个验证器以及解析器来构建您的树.但是,您必须决定要支持的HTML标准,这样,当您发现标记中的弱点时,就会知道这是一个错误,而不仅仅是草率的html.

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

了解所有规则,构建一个验证器,然后就可以构建一个解析器.那是计划A.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

计划B是为了允许解析器中具有一定的抗错误性,这将使验证步骤变得不必要.例如,解析所有标签,并将它们放在列表中,忽略任何属性,以便您可以轻松地在列表上操作,确定标签是保持打开状态还是根本没有打开,从而最终获得良好的"布局树,这将是草率布局的一种近似解决方案,同时对于正确的布局也是精确的解决方案.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

希望有帮助!

这篇关于编写HTML解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆