我的 lex 模式无法匹配我的输入文件,如何更正? [英] My lex pattern doesn't work to match my input file, how to correct it?

查看:90
本文介绍了我的 lex 模式无法匹配我的输入文件,如何更正?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的模式来匹配:head+content+tail,我有一个像下面这样的 lex 文件:

I've got a simple pattern to match: head+content+tail, I've got a lex file like below:

$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%

我希望遇到12"时会打印head",遇到34"时会打印tail",任何其他连续的字符串都会打印content".

I hope when meeting "12" it will print "head", when meet "34" it will print "tail", any other contiguous string, it will print "content".

所以我编译并运行它:

lex b.l && gcc lex.yy.c -ll
$ echo '12sdaesre34'|a.out
content

我的期望是,它会打印

head
content
tail

但实际上它只打印内容"行.我做错了什么,如何改正?

But actually it prints only "content" line. Did I get anything wrong, how to correct it?

谢谢!

推荐答案

(F)lex 总是匹配最长可能的标记.由于 .* 将匹配任何不包含换行符的序列,它会很高兴地匹配 12sdaesre34.(在 (f)lex 中,. 匹配除换行符以外的任何字符.)因此 34 不再可匹配.

(F)lex always matches the longest possible token. Since .* will match any sequence which doesn't contain a newline character, it will happily match 12sdaesre34. (In (f)lex, . matches any character other than newline.) Thus the 34 is no longer available to be matched.

要修复它,您必须清楚您希望content 匹配什么.例如,以下将匹配任何不包含数字的内容:

To fix it, you have to be clear about what you want content to match. For example, the following will match anything which doesn't contain a digit:

[^[:digit:]]+   { printf("content\n"); }

您可能希望在不匹配的字符列表中添加换行符:

You might want to add newline to the list of characters to not match:

[^\n[:digit:]]+   { printf("content\n"); }

或者您可能想要匹配不包含 34 的最长序列.这有点棘手,但可以做到:

Or perhaps you want to match the longest sequence not containing 34. That's trickier but it can be done:

([^3]|3+[^34])+   { printf("content\n"); }

但是,这仍然会匹配初始的12,因此还不足以解决问题.

However, that will still match initial 12, so it won't be enough to solve the problem.

如果您的输入始终由 12...34 形式的字符串组成,可能散布其他内容,您可以匹配整个 12...34 序列并将其拆分为三个令牌.这无疑是最简单的解决方案,因为开始和结束标记的长度是已知的.以下模式中的第一个匹配不以 12 开头、在 12 的第一个实例之前结束的字符串,第二个匹配以 开头的字符串12 并在 34 的第一个实例(匹配)处结束.这两种模式都不会匹配包含不匹配 12 的输入;所以添加了第三条规则来匹配这种情况;它看起来很像第二条规则,但最后不包括 34 的匹配项.因为 (f)lex 总是匹配最长可能的标记,所以第三条规则只有在第二条规则失败时才会成功.

If your input always consists of strings of the form 12...34 possibly interspersed with other content, you can match the entire 12...34 sequence and split it into three tokens. That's undoubtedly the simplest solution, since the beginning and end markers are of a known length. The first of the following patterns matches a string which doesn't start 12, ending just before the first instance of 12, and the second one matches a string starting 12 and ending at the first instance of 34 (which is matched). Neither of the patterns will match an input which contains an unmatched 12; so a third rule is added to match that case; it looks a lot like the second rule but doesn't include the match for 34 at the end. Because (f)lex always matches the longest possible token, the third rule will only succeed if the second rule fails.

([^1]|1+[^12])*         { puts("content"); }
12([^3]|3+[^34])*34     { puts("head content tail"); }
12([^3]|3+[^34])*       { puts("error"); }

通常,您希望实际捕获 content 的值以传递给调用程序.在第一条规则中,这只是 yytext,但在第二条规则中,内容由 yyleng-4 字符组成,从 yytext+2 (以删除前导和尾随定界符).

Normally, you would want to actually capture the value of content to pass on to the calling program. In the first rule, that is just yytext, but in the second rule the content consists of yyleng-4 characters starting at yytext+2 (in order to remove the leading and trailing delimiters).

在大多数情况下,如果需要保留匹配的标记,则需要复制匹配的标记,因为yytext 指向词法扫描器使用的内部数据结构,并且指针将被词法扫描器失效下一个模式匹配.在第一条规则的情况下,您可以使用 strcpy 创建字符串的副本,但对于第二条规则,您需要自己制作副本:

For most purposes, it is necessary to copy the matched token if you need to keep it, because yytext points into an internal data structure used by the lexical scanner and the pointer will be invalidated by the next pattern match. In the case of the first rule, you could create a copy of the string using strcpy, but for the second rule, you'd want to make the copy yourself:

([^1]|1+[^12])*         { yylval = strcpy(yytext); ... }
12([^3]|3+[^34])*34     { yylval = malloc(yyleng-3);
                          memcpy(yylval, yytext, yyleng-4);
                          yylval[yyleng-4] = '\0';
                          ...
                        }

那些假设 yylvalchar* 类型的全局变量,并且在代码中的某处你 free() 保存了字符串按规则.他们还假设您在省略的代码 (...) 中对 yylval 执行了某些操作,或者您返回给调用者并指示头部和尾部是否遇到了.

Those assume that yylval is a global variable of type char*, and that somewhere in the code you free() the string saved by the rule. They also assume that you do something with yylval in the omitted code (...), or that you return to the caller with an indication as to whether the head and tail were encountered.

这篇关于我的 lex 模式无法匹配我的输入文件,如何更正?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆