词法分析器与解析器 [英] lexers vs parsers

查看:40
本文介绍了词法分析器与解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

词法分析器和解析器在理论上真的那么不同吗?

Are lexers and parsers really that different in theory?

讨厌正则表达式似乎很流行:编码恐怖另一篇博文.

It seems fashionable to hate regular expressions: coding horror, another blog post.

然而,流行的基于词法分析的工具:pygmentsgeshi美化,都使用正则表达式.他们似乎什么都可以……

However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...

什么时候词法足够了,什么时候需要 EBNF?

When is lexing enough, when do you need EBNF?

有没有人将这些词法分析器生成的标记与 bison 或 antlr 解析器生成器一起使用?

Has anyone used the tokens produced by these lexers with bison or antlr parser generators?

推荐答案

解析器和词法分析器的共同点:

What parsers and lexers have in common:

  1. 他们从输入中读取一些字母符号.

    • 提示:字母表不一定必须是字母.但它必须是语言的原子符号解析器/词法分析器可以理解.
    • 词法分析器的符号:ASCII 字符.
    • 解析器的符号:特定标记,它们是其语法的终结符.
  1. They read symbols of some alphabet from their input.

    • Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer.
    • Symbols for the lexer: ASCII characters.
    • Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.
  • 真正的区别通常就在这里.请参阅下文了解更多信息.
  • 词法分析器理解的语法:常规语法(乔姆斯基的第 3 级).
  • 解析器理解的语法:上下文无关语法(乔姆斯基的第 2 级).
  • 词法分析器通过将词素(来自输入的符号串)分类为特定的标记来附加意义.例如.所有这些词素:*, ==, <=, ^ 将被归类为operator"标记由 C/C++ 词法分析器编写.
  • 解析器通过将输入(句子)中的标记字符串分类为特定的非终结符并构建解析树来附加含义.例如.所有这些令牌字符串:[number][operator][number][id][operator][id][id][operator][number]][operator][number] 将被 C/C++ 解析器归类为表达式"非终结符.
  • Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer.
  • Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser.
  • 当词法分析器识别出构成正确数字的字符序列时,它可以将其转换为其二进制值并与数字"标记一起存储.
  • 同样,当解析器识别出一个表达式时,它可以计算它的值并将其存储在语法树的表达式"节点中.
  • 词法分析器产生标记,它们是他们识别的常规语言句子.每个标记都可以有一个内部语法(虽然是第 3 级,而不是第 2 级),但这对于输出数据和读取它们的数据无关紧要.
  • 解析器产生语法树,它们是它们识别的上下文无关语言句子的表示.通常它只是整个文档/源文件的一棵大树,因为整个文档/源文件对它们来说是一个合适的句子.但是解析器无法在其输出上生成一系列语法树的原因没有任何原因.例如.它可能是一个解析器,可以识别粘贴到纯文本中的 SGML 标签.因此,它会将 SGML 文档标记成一系列标记:[TXT][TAG][TAG][TXT][TAG][TXT]....
  • Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.
  • Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

如您所见,解析器和分词器有很多共同点.一个解析器可以是另一个解析器的分词器,它从自己的字母表中读取它的输入标记作为符号(标记只是某个字母表的符号),就像一种语言的句子可以是其他更高级别的字母符号一样语.例如,如果 *- 是字母表 M 的符号(作为摩尔斯电码符号"),那么你可以构建一个解析器它将这些点和线的字符串识别为莫尔斯电码中编码的字母.摩尔斯电码"语言中的句子可能是其他解析器的标记,这些标记是其语言的原子符号(例如英语单词"语言).而这些英语单词"可能是某些理解英语句子"语言的高级解析器的标记(字母表的符号).而且所有这些语言的区别仅在于语法的复杂性.仅此而已.

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

那么这些乔姆斯基的语法水平"是怎么回事?好吧,Noam Chomsky 根据语法的复杂程度将语法分为四个级别:

So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:

  • 第 3 级:正则语法

    它们使用正则表达式,即它们只能由字母符号(a,b)、它们的串联(ab,ababbb 等)或替代方案(例如 a|b).
    它们可以作为有限状态自动机 (FSA) 实现),例如 NFA(非确定性有限自动机)或更好的 DFA(确定性有限自动机).
    常规语法无法处理嵌套语法,例如正确嵌套/匹配的括号 (()()(()()))、嵌套的 HTML/BBcode 标签、嵌套的块等.这是因为处理它的状态自动机应该有无限多个状态处理无限多的嵌套级别.
  • 级别 2:上下文无关语法

    它们的语法树中可以有嵌套的、递归的、自相似的分支,因此它们可以很好地处理嵌套结构.
    它们可以实现为带有堆栈的状态自动机.此堆栈用于表示语法的嵌套级别.在实践中,它们通常被实现为自上而下的递归下降解析器,它使用机器的过程调用堆栈来跟踪嵌套级别,并对其语法中的每个非终结符使用递归调用的过程/函数.
    但他们无法处理上下文敏感语法.例如.当你有一个表达式 x+3 并且在一个上下文中这个 x 可能是一个变量的名称,而在其他上下文中它可能是一个函数的名称等.
  • 级别 1:上下文相关语法

  • 级别 0:无限制语法
    也称为递归可枚举语法.

  • Level 3: Regular grammars

    They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).
    They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).
    Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.
  • Level 2: Context-free grammars

    They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.
    They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.
    But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc.
  • Level 1: Context-sensitive grammars

  • Level 0: Unrestricted grammars
    Also called recursively enumerable grammars.

这篇关于词法分析器与解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆