(F)Lex,我该如何匹配否定? [英] (F) Lex, how do I match negation?

查看:109
本文介绍了(F)Lex,我该如何匹配否定?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些语言语法在其规则中使用否定词.例如,在Dart规范中,使用以下规则:

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:

~('\'|'"'|'$'|NEWLINE)

这意味着匹配任何不是括号内的规则之一的东西.现在,我知道在flex中我可以否定字符规则(例如:[^ ab],但是我想否定的某些规则可能比单个字符复杂得多,因此我认为我不能为此使用字符规则.例如,我可能需要取消多行字符串的序列""",但我不确定在flex中将采用哪种方式.

Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.

推荐答案

(TL; DR:跳到底部以获得实际答案.)

(TL;DR: Skip down to the bottom for a practical answer.)

任何常规语言的反义词是常规语言.因此,理论上可以将正则表达式的逆写为正则表达式.不幸的是,这并不总是那么容易.

The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.

至少,"""案例不是太难.

首先,让我们弄清楚我们要匹配的内容.

First, let's be clear about what we are trying to match.

严格地说不是""""将表示除"""以外的任何字符串".但这将包括例如x""".

Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".

因此,很可能会说我们正在寻找任何不包含"""的字符串". (即.*""".*的倒数).但这也不是正确的.典型的用法是将输入标记化:

So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:

"""This string might contain " or ""."""

如果我们从初始"""开始,然后查找不包含"""的最长字符串,则会发现:

If we start after the initial """ and look for the longest string which doesn't contain """, we will find:

This string might contain " or "".""

而我们想要的是:

This string might contain " or "".

因此事实证明,我们需要任何不以"结尾且不包含"""的字符串",这实际上是两个反函数的结合:(~.*" ∧ ~.*""".*)

So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)

(相对)容易生成状态图:

It's (relatively) easy to produce a state diagram for that:

(请注意,上面的状态图与任何不包含"""的字符串"状态图之间的唯一区别是,在该状态图中,所有状态都将被接受,而在该状态下,状态1和2不接受.)

(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)

现在,挑战在于将其转换为正则表达式.有自动的技术可以做到这一点,但是它们产生的正则表达式通常又长又笨拙.不过,这种情况很简单,因为只有一个接受状态,我们只需要描述所有可以在该状态下结束的路径:

Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:

([^"]|\"([^"]|\"[^"]))*

该模型适用于任何简单的字符串,但是当字符串不仅仅是相同字符的序列时,它会稍微复杂一些.例如,假设我们要匹配以END而不是"""终止的字符串.天真的修改上面的模式会导致:

This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:

([^E]|E([^N]|N[^D]))*   <--- DON'T USE THIS

但是该正则表达式将匹配字符串

but that regular expression will match the string

ENENDstuff which shouldn't have been matched

我们正在寻找的真实状态图是

The real state diagram we're looking for is

将其写为正则表达式的一种方式是:

and one way of writing that as a regular expression is:

([^E]|E(E|NE)*([^EN]|N[^ED]))

再次,我通过跟踪最终进入状态0的所有方式来生成该结果:

Again, I produced that by tracing all the ways to end up in state 0:

[^E] stays in state 0
E    in state 1:
     (E|NE)*: stay in state 1
     [^EN]: back to state 0
     N[^ED]:back to state 0 via state 2

这可能是很多工作,包括制作和阅读.结果容易出错. (使用状态图来进行形式验证比较容易,而对于这类问题而言,状态图要小得多,而对于正则表达式来说,状态图可能会变得很大).

This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).

实用的Flex规则集使用开始条件解决此类问题问题.例如,以下是您可能会识别python三引号字符串的方法:

Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:

%x TRIPLEQ
start \"\"\"
end   \"\"\"
%%

{start}        { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }

<TRIPLEQ>.|\n  { /* Append the next token to yytext instead of
                  * replacing yytext with the next token
                  */
                 yymore();
                 /* No return yet, flex continues */
               }
<TRIPLEQ>{end} { /* We've found the end of the string, but
                  * we need to get rid of the terminating """
                  */
                 yylval.str = malloc(yyleng - 2);
                 memcpy(yylval.str, yytext, yyleng - 3);
                 yylval.str[yyleng - 3] = 0;
                 return STRING;
               }

之所以可行,是因为如果"是与{end}匹配的字符串的一部分,则起始条件TRIPLEQ中的.规则将与"不匹配; flex总是选择最长的匹配项.通过使用[^"]+|\"|\n而不是.|\n可以提高效率,因为这将导致更长的匹配时间,从而减少对yymore()的调用;为了清晰起见,我并不是在上面那样写的.

This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.

此模型易于扩展.特别是,如果我们想使用<![CDATA[作为开始,并使用]]>作为终止符,则只需更改定义

This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions

start "<![CDATA["
end   "]]>"

(如果使用上面建议的优化方法,可能还有起始条件中的优化规则.)

(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

这篇关于(F)Lex,我该如何匹配否定?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆