令牌和词素有什么区别? [英] What is the difference between a token and a lexeme?

查看:97
本文介绍了令牌和词素有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Aho Ullman和Sethi的Compiler Construction中,假设源程序的输入字符串被划分为具有逻辑含义的字符序列,称为记号,而lexemes是组成该序列的序列。令牌,那么基本区别是什么?

解决方案

使用 编译器原理,技术和工具,第二版。 (WorldCat),由Aho,Lam,Sethi和Ullman(又名紫色龙书



Lexeme pg。 111


Lexeme是源程序中的一系列字符,其中
与令牌的模式匹配,并由词法
分析器作为该令牌的实例。


令牌pg。 111


令牌是一对,由令牌名称和可选属性
值组成。令牌名称是一个抽象符号,代表一种
词法单元,例如特定的关键字,或表示标识符的输入
字符序列。令牌名称是解析器处理的输入
符号。


模式pg。 111


模式是对令牌的词位可以
采取的形式的描述。在使用关键字作为标记的情况下,模式只是构成关键字的
个字符序列。对于标识符和其他
个令牌,该模式是更复杂的结构,由
个许多字符串匹配。


图3.2:令牌示例pg.112

  [令牌] [非正式描述] [样本词位] 
如果是字符i,f如果
个其他字符e,l,s,e else
比较<或>或< =或> =或==或!=< =,!=
id字母,后跟字母和数字pi,score,D2
任意数字常数3.14159、0、6.02e23
字面意义除外,但以的 core dumped包围着

为了更好地理解这种与词法分析器和解析器的关系,我们将从解析器开始,然后向后处理到输入。



为了简化解析器的设计,解析器无法与直接输入,但接受由词法分析器生成的标记的列表。查看图3.2中的令牌列,我们看到诸如 if else comparison之类的令牌 id 编号文字;这些是令牌的名称。通常,在词法分析器/解析器中,令牌是一种结构,它不仅保存令牌的名称,还保存组成令牌的字符/符号以及组成令牌的字符串的开始和结束位置,并带有起始位置和结束位置用于错误报告,突出显示等。



现在,词法分析器接受字符/符号的输入,并使用词法分析器的规则转换输入的字符/将符号符号化为令牌。现在,使用词法分析器/解析器的人对自己经常使用的东西有了自己的说法。您认为构成标记的字符/符号序列就是使用词法分析器/解析器的人所说的lexeme。因此,当您看到lexeme时,只需考虑代表令牌的一系列字符/符号。在比较示例中,字符/符号的顺序可以是不同的模式,例如< > else 3.14 等。



另一种思考方式两者之间的关系是令牌是解析器使用的一种编程结构,它具有一个称为lexeme的属性,该属性保存来自输入的字符。现在,如果您在代码中查看令牌的大多数定义,您可能不会将lexeme视为令牌的属性之一。这是因为令牌很可能会保留代表令牌和词位的字符/符号的开始和结束位置,由于输入是静态的,因此可以根据需要从开始和结束位置派生字符/符号的序列。 / p>

In Compiler Construction by Aho Ullman and Sethi, it is given that the input string of characters of the source program are divided into sequence of characters that have a logical meaning, and are known as tokens and lexemes are sequences that make up the token so what is the basic difference?

解决方案

Using "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book,

Lexeme pg. 111

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Token pg. 111

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes.

Pattern pg. 111

A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is more complex structure that is matched by many strings.

Figure 3.2: Examplesof tokens pg.112

[Token]       [Informal Description]                  [Sample Lexemes]
if            characters i, f                         if
else          characters e, l, s, e                   else
comparison    < or > or <= or >= or == or !=          <=, !=
id            letter followed by letters and digits   pi, score, D2
number        any numeric constant                    3.14159, 0, 6.02e23
literal       anything but ", surrounded by "'s       "core dumped"

To better understand this relation to a lexer and parser we will start with the parser and work backwards to the input.

To make it easier to design a parser, a parser does not work with the input directly but takes in a list of tokens generated by a lexer. Looking at the token column in Figure 3.2 we see tokens such as if, else, comparison, id, number and literal; these are names of tokens. Typically with a lexer/parser a token is a structure that holds not only the name of the token, but the characters/symbols that make up the token and the start and end position of the string of characters that make up the token, with the start and end position being used for error reporting, highlighting, etc.

Now the lexer takes the input of characters/symbols and using the rules of the lexer converts the input characters/symbols into tokens. Now people who work with lexer/parser have their own words for things they use often. What you think of as a sequence of characters/symbols that make up a token are what people who use lexer/parsers call lexeme. So when you see lexeme, just think of a sequence of characters/symbols representing a token. In the comparison example, the sequence of characters/symbols can be different patterns such as < or > or else or 3.14, etc.

Another way to think of the relation between the two is that a token is a programming structure used by the parser that has a property called lexeme that holds the character/symbols from the input. Now if you look at most definitions of token in code you may not see lexeme as one of the properties of the token. This is because a token will more likely hold the start and end position of the characters/symbols that represent the token and the lexeme, sequence of characters/symbols can be derived from the start and end position as needed because the input is static.

这篇关于令牌和词素有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆