我可以在lex代码中指定模式匹配优先级吗? [英] Could I specify pattern match priority in lex code?

查看:101
本文介绍了我可以在lex代码中指定模式匹配优先级吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网站上有一个相关主题(

I've got a related thread in the site(My lex pattern doesn't work to match my input file, how to correct it?)

我遇到的问题是关于如何贪婪"的.lex会进行模式匹配,例如我有我的lex文件:

The problems I met, is about how "greedy" lex will do pattern match, e.g. I've got my lex file:

$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%

我想说的是,当遇到"12"时,打印"head";当遇到"34"时,打印尾巴",否则打印内容".对于不包含"12"的最长匹配或"34".

What I wish to say is, when meet "12", print "head"; when meet "34", print "tail", otherwise print "content" for the longest match that doesn't contain either "12" or "34".

但事实是,.*"是一个贪婪的匹配项,无论我输入什么内容,它都会显示内容".

But the fact was, ".*" was a greedy match that whatever I input, it prints "content".

我的要求是,当我使用

12sdf2dfsd3sd34

12sdf2dfsd3sd34

作为输入,输出应为

head
content
tail

所以似乎有两种可能的方法:

So seems there're 2 possible ways:

1,要为.*"指定匹配优先级,则仅当"12"和."均不匹配时才起作用.和"34"表示可以匹配.lex是否支持优先级"?

1, To specify a match priority for ".*", it should work only when neither "12" and "34" works to match. Does lex support "priority"?

2,以改变第三表达式,以匹配不包含子字符串"12"或"34"的任何连续字符串.但是如何编写此正则表达式?

2, to change the 3rd expression, as to match any contiguous string that doesn't contain sub-string of "12", or "34". But how to write this regular expression?

推荐答案

  1. (f)lex是否支持优先级?

(F)lex总是产生最长的匹配项.如果有多个规则与相同的最长匹配匹配,则选择第一个,因此在这种情况下,它支持优先级.但是它不支持较短匹配的优先级,也不实现非贪婪匹配.

(F)lex always produces the longest possible match. If more than one rule matches the same longest match, the first one is chosen, so in that case it supports priority. But it does not support priority for shorter matches, nor does it implement non-greedy matching.

  1. 如何匹配不包含一个或多个序列的字符串?

您可以通过一些工作来创建一个正则表达式,该正则表达式与不包含指定子字符串的字符串匹配,但这并不是特别容易,并且(f)lex不提供此类正则表达式的简单语法.

You can, with some work, create a regular expression which matches a string not containing specified substrings, but it is not particularly easy and (f)lex does not provide a simple syntax for such regular expressions.

一个更简单(但效率略低)的解决方案是将字符串分段匹配.大致来说,您可以执行以下操作:

A simpler (but slightly less efficient) solution is to match the string in pieces. As a rough outline, you could do the following:

"12"     { return HEAD; }
"34"     { if (yyleng > 2) {
             yyless(yyleng - 2);
             return CONTENT;
           }
           else
             return TAIL;
         }
.|\n     { yymore(); }

当没有机会跳过定界符时,可以通过匹配多个字符来提高效率.将最后一条规则更改为:

This could be made more efficient by matching multiple characters when there is not chance of skipping a delimiter; change the last rule to:

.|[^13]+ { yymore(); }

yymore()导致保留当前令牌,因此下一个匹配项将追加到当前令牌上,而不是开始新的令牌. yyless(x)将除前 x 个字符外的所有字符返回到输入流;在这种情况下,这将导致在识别出CONTENT令牌后重新扫描结束定界符 34 .

yymore() causes the current token to be retained, so that the next match appends to the current token rather than starting a new token. yyless(x) returns all but the first x characters to the input stream; in this case, that is used to cause the end delimiter 34 to be rescanned after the CONTENT token is identified.

(假设您实际上是要对输入流进行标记化,而不仅仅是打印调试消息,这就是为什么我将其称为概要解决方案.)

(That assumes you actually want to tokenize the input stream, rather than just print a debugging message, which is why I called it an outline solution.)

这篇关于我可以在lex代码中指定模式匹配优先级吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆