如何解析器(例如,HTML)工作? [英] How does a parser (for example, HTML) work?

查看:119
本文介绍了如何解析器(例如,HTML)工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关参数的缘故,让我们假设一个HTML解析器。

For argument's sake lets assume a HTML parser.

我读过它的 tokenizes 的一切,然后再分析它。

I've read that it tokenizes everything first, and then parses it.

这是什么记号化意味着什么?

What does tokenize mean?

是否解析器读取每个字符每个,建立一个多维数组存储结构?

Does the parser read every character each, building up a multi dimensional array to store the structure?

例如,它读取< ,然后开始捕捉元素,然后一旦它遇到了一个闭合的> (属性之外),将其推到一个数组栈的地方?

For example, does it read a < and then begin to capture the element, and then once it meets a closing > (outside of an attribute) it is pushed onto a array stack somewhere?

我感兴趣的知道(我很好奇)的缘故。

I'm interested for the sake of knowing (I'm curious).

如果我是通过像 HTML净化器读取源,将是给我的怎么是个好主意HTML解析?

If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?

推荐答案

首先,你应该知道,解析HTML是特别难看 - HTML是被标准化前宽(和发散)的​​使用。这导致了丑陋的所有方式,如标准规定,有些结构是不允许的,但然后指定所需的行为对于那些结构反正。

First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.

获取你的直接的问题:符号化大致相当于服用英语,分解成单词。在英语中,大部分单词是字母连续流,可能包括撇号,连字号等词大多由空格包围,而是一个句号,问号,感叹号等,也可以预示着单词的结尾。同样对于HTML(或其他),你指定有关可以做什么了这种语言的令牌(字)的一些规则。打破输入成记号的一块code的通常被称为词法分析器。

Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.

至少在正常情况下,你的的打破所有的输入为特征符开始解析之前。相反,解析器将调用词法分析器来获得下一个记号时,它需要之一。当它的名字,词法分析器着眼于足够的输入的找一个令牌,技术能够提供给解析器,并且没有更多的输入的被标记化的下一次解析器需要更多的输入,直到

At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.

在一般的方式,你是对的分析器是如何工作的,但(至少在一个典型的解析器),它解析一个声明的行为时使用堆栈,但是它建立重新present一个声明通常是一棵树(和抽象语法树,又名AST),而不是一个多维数组。

In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.

,我会保留在寻找它的分析器,直到你经历了几个人先读。如果你做一些环顾四周,你应该能够找到的东西像数学前pressions这可能更适合作为引进了相当数量的解析器/词法分析器(更小的,更简单,更容易理解,等等。)

Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)

这篇关于如何解析器(例如,HTML)工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆