Lex:标识符与整数 [英] Lex: identifier vs integer

查看:227
本文介绍了Lex:标识符与整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建自己的简单编程语言.为此,我需要在Lex中插入一些正则表达式. 我正在使用以下正则表达式来匹配标识符和整数.

I'm trying to create my own simple programming language. For this I need to insert some regex into Lex. I'm using the following regex to match identifiers and integers.

[a-zA-Z][a-zA-Z0-9]* /* identifier */ return IDENTIFIER;
("+"|"-")?[0-9]+ /* integer */ return INTEGER;

现在,当我检查例如非法标识符,例如:

Now when I check for example an illegal identifier like:

0a = 1;

前导零被识别为整数,后跟被识别为标识符的"a".取而代之的是,我希望将此标记"0a"识别为非法字符. 如何包含此功能?我必须调整什么正则表达式?

The leading zero is recognized as an integer followed by the 'a' recognized as an identifier. Instead of this I want this token '0a' to be recognized as an illegal character. How do I include this functionality? What regex do I have to adjust?

推荐答案

(F)lex中最简单的方法是为错误创建第二个模式:

The easiest way to do this in (F)lex is to create a second pattern for the error:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+[[:alpha:]]   {
                           fprintf(stderr,
                                   "Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           return ERROR;
                         }

第三个规则将匹配紧随其后的字母的任何整数,并表示出现词法错误. (我假设您已启用%option yylineno.如果未启用,则始终会在第0行报告错误.)

The third rule will match any integer with a letter immediately following, and will signal a lexical error. (I'm assuming you've enable %option yylineno. If not, that will always report the error on line 0.)

另一种选择是继续进行词法扫描.在这种情况下,您可能需要重新扫描有问题的字母字符.在Flex中,最简单的方法是使用(特有的)跟踪上下文运算符/:

An alternative might be to continue the lexical scan. In this case, you might want to rescan the offending alphabetic character. The easiest way to do this is in Flex is to use it's (idiosyncratic) trailing context operator /:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+/[[:alpha:]]   {
                           fprintf(stderr, 
                                   "Warning: Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           return INTEGER;
                         }

现在,第三个规则将完全匹配相同的内容,但是在匹配之后,它将退回到数字的末尾,以便下一个词素将以字母字符开头.

Now the third rule will match exactly the same thing, but after it matches it will back off to the end of the number so that the next lexeme will start with the alphabetic character.

您也可以使用 yyless() 宏:

You can also do this with the yyless() macro:

yyless(n)将当前令牌的除了前n个字符之外的所有字符都返回到输入流…

yyless(n) returns all but the first n characters of the current token back to the input stream…

因此您可以使用:

[[:alpha:]][[:alnum:]]*  return IDENTIFIER;
[+-]?[[:digit:]]+        return INTEGER;
[+-]?[[:digit:]]+[[:alpha:]]   {
                           fprintf(stderr, 
                                   "Warning: Incorrect integer '%s' in line %d\n",
                                   yytext, yylineno);
                           yyless(yyleng - 1);
                           return INTEGER;
                         }

最后,正如@CharlieBurns在评论中指出的那样,您可以让词法分析器将两个标记(一个数字和一个标识符)返回给解析器,如果该序列在该语言中是非法的,则该语法将识别出语法错误.在许多编程语言中,任何语法程序都不能立即包含一个整数,后跟一个标识符,并且中间没有标点符号.

Finally, as @CharlieBurns points out in a comment, you can just let the lexer return two tokens (one number and one identifier) to the parser, which will recognize a syntax error if that sequence is illegal in the language. In many programming languages, no grammatical program can contain an integer immediately followed by an identifier without some punctuation in between.

但是,在许多其他语言中,这种组合是完全合理的,尤其是在Lua之类的语言中,没有明确的语句结尾指示符,因此

However, in many other languages, the combination is perfectly reasonable, particularly in languages like Lua where there is no explicit end-of-statement indicator, so

 b = 3 a = 4

是由两个赋值语句组成的有效程序.再举一个例子,在Awk中,字符串连接表示为无操作符,并且在必要时将数字自动强制为字符串,因此

is a valid program consisting of two assignment statements. As another example, in Awk string concatenation is represented with no operator and numbers are automatically coerced to strings if necessary, so

print 3 a

将打印"3"的串联和a的值.在上面的示例中,Lua坚持使用空格; Awk不会.

will print the concatenation of "3" and the value of a. Lua insists on whitespace in the above example; Awk does not.

并且,最终,C(++)认为3a是单个令牌,即预处理编号".如果令牌实际上通过了预处理器,则将标记一个错误,但是以下程序没有语法错误:

And, for the ultimate, C(++) considers 3a to be a single token, a "pre-processing number". If the token actually passes through the preprocessor, an error will be flagged, but the following program has no syntax errors:

#define NOTHING(x)
NOTHING(3a)

作为一个可能更有趣的示例:

As a possibly more interesting example:

#define CONCAT2(a,b) a##b
#define CONCAT(a,b) CONCAT2(a,b)
static const int the_answer = CONCAT(0x, 2a);

因此,没有一个答案能解决所有问题".

So there is no "one answer fits all".

这篇关于Lex:标识符与整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆