输出Lexer [英] Output of Lexer

查看:175
本文介绍了输出Lexer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写一个编译器,我在Lexer阶段。



我知道词法分词器对输入流进行符号化。



但是,请考虑以下流:

  int foo = 0 ;如果词法分析器的输出是:关键字,则

> letter 数字 分号?然后解析器将字母 字母 字母标识符?

解决方案

通常,您的词法分析器应产生 structs 包含语言元素:运算符,标识符,关键字,注释等。这些结构体应该标记词类型,并携带与它代表的词位类型相关的内容。



为了实现良好的错误报告,如果每个词位携带关于起始行和列,结束行和列(一些词典跨越多行)的信息和起源源文件(有时解析器必须处理包括的文件



对于包含变量内容(数字,标识符等)的语言元素,struct应包含变量内容。



对于编译或程序分析,词法分析器可以抛弃空格和注释。如果您打算解析/修改代码,您需要捕获注释。



示例输出可以是有启发性的。
对于OP的示例的变体:

  / *我的测试文件* / 

int foo
= 0; //声明

... DMS的C前端产生以下词法,在设计复杂词法分析器时非常方便):

  C:\DMS\Domains\C\GCC4 \Tools\Lexer\Source> run ../domainlexer C:\temp\test.c 
Lexer流显示1.5.1
使用编码Unicode-UTF-8?ANSI + CRLF +1 / ^ I
! Lexer:ResetLexicalModeStack
!后Lexer:PushLexicalMode:
词法模式堆栈:
1 C
文件C:/temp/test.c,第1行:/ *我的测试文件* /
文件C:/temp/test.c,第2行:
文件C:/temp/test.c,第3行:int foo
! Lexer:GotoLexicalMode 2 CMain
! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23:'int'[VOID] = 0000
<< PreComments:
注释1类型1行1列1`/ *我的测试文件* /'
! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2:whitespace [VOID] = 0000
! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210:IDENTIFIER [STRING] =`foo'
文件C:/temp/test.c,第4行:= 0; //声明
! Lexer:GotoLexicalMode 1 C
! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2:whitespace [VOID] = 0000
! Lexer:GotoLexicalMode 2 CMain
! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113:'='[VOID] = 0000
! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2:whitespace [VOID] = 0000
! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138:INT_LITERAL [NATURAL] = 0
文件C:/temp/test.c,第5行:
! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98:';'[VOID] = 0000
>>后缀:
注释1类型2行4列10`//声明
文件C:/temp/test.c,第5行:
文件C:/ temp / test.c,第6行:
文件C:/temp/test.c,第7行:
! Lexer:GotoLexicalMode 1 C
! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2:whitespace [VOID] = 0000
! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4:end_of_input_stream [VOID] = 0000
! Lexer:GotoLexicalMode 2 CMain
! Lexeme @ Line 7 Col 1 ELine 7 ECol 1令牌0:EndOfFile
11处理lexemes。
0检测到词法错误。

C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>

主输出是标记为 !! 的行,词法分析器生成的词典结构的内容。每个词位携带:




  • 源文件位置信息(对于主文件,在这种情况下为test.c,不打印使得调试输出更易读)

  • 一个令牌数(lexeme类型)和人类可读的令牌名称(使调试更容易)

  • 令牌携带的值的类型:[VOID]表示无,[STRING]表示令牌携带字符串值,[NATURAL]表示它携带整数值等。

  • 预注释:令牌前面的注释。这对于经典的词法分析器来说是不寻常的,但是如果试图转换源代码,则是必要的。你不能失去评论!请注意,预先注释已附加到令牌;因为注释在语义上没有意义,可以争论应该放置它们的位置。这是我们的特殊选择。



  • $ b

    最后一个tokenEndOfFile在每个DMS词汇表中都是隐式定义的。



    此调试跟踪还记录词法模式(许多词法生成器具有多种模式,其中它们表示语言的各个部分)。它显示读取源代码行。


    I am currently writing a compiler and I'm in the Lexer phase.

    I know that the lexer tokenizes the input stream.

    However, consider the following stream:

    int foo = 0;
    

    should the output of the lexer be: Keyword letter letter letter equals digit semicolon ? And then the parser reduces the letter letter letter to an identifier ?

    解决方案

    In general, your lexer should produce a stream of structs that contain language elements: operators, identifiers, keywords, comments, etc. These structs should be marked with type of the lexeme, and carry content relevant to the type of lexeme it represents.

    To enable good error reporting, it is good if each lexeme carries information about starting line and column, endline line and column (some lexemes span multiple lines), and the originating source file (sometimes a parser has to handle included files as well as the main file).

    For those language elements that contain variable content (numbers, identifiers, etc.), the struct should contain the variable content.

    For compiling or program analysis, the lexer can throw whitespace and comments away. If you intend to parse/modify the code, you'll need to capture comments.

    An example output can be instructive. For a variant of OP's example:

    /* My test file */
    
    int foo
        = 0; // a declaration
    

    ... DMS's C front end produces the following lexemes (this is a debug output, really handy to have when designing a complex lexer):

    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>run ../domainlexer C:\temp\test.c
    Lexer Stream Display 1.5.1
    Using encoding Unicode-UTF-8?ANSI +CRLF +1 /^I
    !! Lexer:ResetLexicalModeStack
    !! after Lexer:PushLexicalMode:
    Lexical Mode Stack:
    1 C
    File "C:/temp/test.c", line 1: /* My test file */
    File "C:/temp/test.c", line 2:
    File "C:/temp/test.c", line 3: int foo
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 3 Col 1 ELine 3 ECol 4 Token 23: 'int' [VOID]=0000
      <<< PreComments:
    Comment 1 Type 1 Line 1 Column 1 `/* My test file */'
    !! Lexeme @ Line 3 Col 4 ELine 3 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 3 Col 5 ELine 3 ECol 8 Token 210: IDENTIFIER [STRING]=`foo'
    File "C:/temp/test.c", line 4:     = 0; // a declaration
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 3 Col 8 ELine 4 ECol 5 Token 2: whitespace [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 4 Col 5 ELine 4 ECol 6 Token 113: '=' [VOID]=0000
    !! Lexeme @ Line 4 Col 6 ELine 4 ECol 7 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 4 Col 7 ELine 4 ECol 8 Token 138: INT_LITERAL [NATURAL]=0
    File "C:/temp/test.c", line 5:
    !! Lexeme @ Line 4 Col 8 ELine 4 ECol 9 Token 98: ';' [VOID]=0000
      >>> PostComments:
    Comment 1 Type 2 Line 4 Column 10 `// a declaration'
    File "C:/temp/test.c", line 5:
    File "C:/temp/test.c", line 6:
    File "C:/temp/test.c", line 7:
    !! Lexer:GotoLexicalMode 1 C
    !! Lexeme @ Line 4 Col 26 ELine 7 ECol 1 Token 2: whitespace [VOID]=0000
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 4: end_of_input_stream [VOID]=0000
    !! Lexer:GotoLexicalMode 2 CMain
    !! Lexeme @ Line 7 Col 1 ELine 7 ECol 1 Token 0: EndOfFile
    11 lexemes processed.
    0 lexical errors detected.
    
    C:\DMS\Domains\C\GCC4\Tools\Lexer\Source>
    

    The main output are lines marked !!, each of which represents the contents of a lexeme struct produced by the lexer. Each lexeme carries:

    • source file location information (for the main file, "test.c" in this case, that is not printed to make the debug output a bit more readable)
    • a "token number" (lexeme type) and the human-readable token name (makes debugging a lot easier)
    • the type of value carried by the token: [VOID] means "none", [STRING] means the token carries a string values, [NATURAL] means it carries an integral value, etc.
    • precomments: Comments preceding the token. This is unusual for classic lexers, but necessary if one is trying to transform source code. You can't lose the comments! Note the precomment is attached to a token; because comments are not semantically meaningful, one can argue where they should be placed. This is our particular choice.
    • postcomment: Comments that follow the token that belong to it.

    The last "token" EndOfFile is implicit defined in every DMS lexer.

    This debug trace also notes transitions of the lexer across lexical modes (many lexer generators have multiple modes in which they lex various parts of a language). It shows source lines as they are read.

    这篇关于输出Lexer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆