我的lex模式无法匹配我的输入文件,如何更正它? [英] My lex pattern doesn't work to match my input file, how to correct it?

查看:164
本文介绍了我的lex模式无法匹配我的输入文件,如何更正它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的模式可以匹配:head + content + tail,我有一个lex文件,如下所示:

I've got a simple pattern to match: head+content+tail, I've got a lex file like below:

$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%

我希望遇到"12"时会打印"head",遇到"34"时会打印"tail",其他任何连续的字符串,也会打印"content".

I hope when meeting "12" it will print "head", when meet "34" it will print "tail", any other contiguous string, it will print "content".

所以我编译并运行它:

lex b.l && gcc lex.yy.c -ll
$ echo '12sdaesre34'|a.out
content

我的期望是,它将打印

head
content
tail

但是实际上它只打印内容"行.我有什么不对的地方,如何纠正?

But actually it prints only "content" line. Did I get anything wrong, how to correct it?

谢谢!

推荐答案

(F)lex始终与最长的令牌匹配.由于.*将匹配任何不包含换行符的序列,因此它将很高兴地匹配12sdaesre34. (在(f)lex中,.匹配换行符以外的任何字符.)因此34不再可用于匹配.

(F)lex always matches the longest possible token. Since .* will match any sequence which doesn't contain a newline character, it will happily match 12sdaesre34. (In (f)lex, . matches any character other than newline.) Thus the 34 is no longer available to be matched.

要修复它,您必须清楚要content匹配的内容.例如,以下内容将匹配不包含数字的任何内容:

To fix it, you have to be clear about what you want content to match. For example, the following will match anything which doesn't contain a digit:

[^[:digit:]]+   { printf("content\n"); }

您可能想将换行符添加到不匹配的字符列表中:

You might want to add newline to the list of characters to not match:

[^\n[:digit:]]+   { printf("content\n"); }

或者您可能想匹配不包含34的最长序列.这比较棘手,但是可以做到:

Or perhaps you want to match the longest sequence not containing 34. That's trickier but it can be done:

([^3]|3+[^34])+   { printf("content\n"); }

但是,它仍然与初始12相匹配,因此不足以解决问题.

However, that will still match initial 12, so it won't be enough to solve the problem.

如果输入内容始终由形式为12...34的字符串组成,并且可能散布在其他内容中,则可以匹配整个12...34序列并将其分为三个标记.无疑,这是最简单的解决方案,因为开始和结束标记的长度是已知的.以下模式中的第一个与不以12开头的字符串匹配,该字符串恰好在12的第一个实例之前结束,而第二个模式与以12的开头并且在34的第一个实例结尾的字符串匹配(已匹配).这两种模式都不会匹配包含不匹配的12的输入.因此添加了第三条规则来匹配这种情况;它看起来很像第二条规则,但最后没有包含34的匹配项.由于(f)lex始终与可能的最长令牌匹配,因此只有第二条规则失败时,第三条规则才会成功.

If your input always consists of strings of the form 12...34 possibly interspersed with other content, you can match the entire 12...34 sequence and split it into three tokens. That's undoubtedly the simplest solution, since the beginning and end markers are of a known length. The first of the following patterns matches a string which doesn't start 12, ending just before the first instance of 12, and the second one matches a string starting 12 and ending at the first instance of 34 (which is matched). Neither of the patterns will match an input which contains an unmatched 12; so a third rule is added to match that case; it looks a lot like the second rule but doesn't include the match for 34 at the end. Because (f)lex always matches the longest possible token, the third rule will only succeed if the second rule fails.

([^1]|1+[^12])*         { puts("content"); }
12([^3]|3+[^34])*34     { puts("head content tail"); }
12([^3]|3+[^34])*       { puts("error"); }

通常,您实际上希望捕获content的值以传递给调用程序.在第一个规则中,它只是yytext,但是在第二个规则中,内容由以yytext+2开头的yyleng-4字符组成(为了删除前导和尾随定界符).

Normally, you would want to actually capture the value of content to pass on to the calling program. In the first rule, that is just yytext, but in the second rule the content consists of yyleng-4 characters starting at yytext+2 (in order to remove the leading and trailing delimiters).

在大多数情况下,如果需要保留匹配的标记,则必须复制它,因为yytext指向词法扫描器使用的内部数据结构,并且下一个模式匹配将使指针无效.对于第一条规则,可以使用strcpy创建字符串的副本,但是对于第二条规则,则需要自己创建副本:

For most purposes, it is necessary to copy the matched token if you need to keep it, because yytext points into an internal data structure used by the lexical scanner and the pointer will be invalidated by the next pattern match. In the case of the first rule, you could create a copy of the string using strcpy, but for the second rule, you'd want to make the copy yourself:

([^1]|1+[^12])*         { yylval = strcpy(yytext); ... }
12([^3]|3+[^34])*34     { yylval = malloc(yyleng-3);
                          memcpy(yylval, yytext, yyleng-4);
                          yylval[yyleng-4] = '\0';
                          ...
                        }

假定yylval是类型为char*的全局变量,并且假定您在代码的某个位置free()规则保存的字符串.他们还假定您在省略的代码(...)中使用yylval做某事,或者您返回到呼叫者并指出是否遇到了头和尾.

Those assume that yylval is a global variable of type char*, and that somewhere in the code you free() the string saved by the rule. They also assume that you do something with yylval in the omitted code (...), or that you return to the caller with an indication as to whether the head and tail were encountered.

这篇关于我的lex模式无法匹配我的输入文件,如何更正它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆