(*SKIP) 或 (*F) 如何处理正则表达式? [英] How do (*SKIP) or (*F) work on regex?

查看:88
本文介绍了(*SKIP) 或 (*F) 如何处理正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习正则表达式的高级用法,并注意到许多帖子在其中使用 (*SKIP)(*F).

I'm learning an advanced usage of regex and noticed that many posts use (*SKIP) or (*F) in it.

我发布了一个问题,其中的想法是匹配没有 yellow 但有 blue 的行,前提是 brown 存在于 blue 之后.正确答案是:

I posted a question where the idea was to match lines that don't have yellow but has blue only if brown exists after blue. And the right answer was:

.*yellow.*(*SKIP)(*F)|^.*\bblue\b(?=.*brown).*$

我也尝试过如下所示的环视表达式,但不适用于所有情况:

I also have tried lookaround expressions like below but haven't worked for all the cases:

^((?!yellow).)*blue(?=.*brown).*$

我不知道这些 (*SKIP)(*F) 标志,所以问题是,这些标志是如何工作的?他们在做什么?还有其他类似的标志吗?

I had no idea about these (*SKIP)(*F) flags, so the question is, how do these flag works? What do they do? And are there other flags like these?

谢谢.

推荐答案

这两个回溯控制动词仅在 Perl、PCRE 和 pypi 正则表达式模块.

These two backtracking control verbs are implemented only in Perl, PCRE and the pypi regex module.

(*SKIP)(*FAIL) 技巧的想法是使用您想要避免的字符,并且不能成为匹配结果的一部分.

The idea of the (*SKIP)(*FAIL) trick is to consume characters that you want to avoid, and that must not be a part of the match result.

使用此技巧的经典模式如下所示:

A classical pattern that uses of this trick looks like that:

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match

正则表达式引擎处理这样的字符串:

A regex engine processes a string like that:

  • 在每个字符上从左到右测试模式的第一个标记(默认情况下大部分时间,但一些正则表达式引擎可以设置为从右到左工作,.net可以做到如果我记得清楚的话)

如果第一个标记匹配,则正则表达式引擎使用下一个字符测试模式的下一个标记(在第一个标记匹配之后) 等等.

if the first token matches, then the regex engine tests the next token of the pattern with the next characters (after the first token match) etc.

当令牌失败时,正则表达式引擎取回与最后一个令牌匹配的字符,并尝试另一种方式使模式成功(如果它也不起作用,则正则表达式引擎做同样的事情与之前的令牌等)

when a token fails, the regex engine gets the characters matched by the last token back and tries another way to make the pattern succeed (if it doesn't work too, the regex engine do the same with the previous token etc.)

当正则表达式引擎遇到(*SKIP)动词(在这种情况下,所有前面的标记显然都成功了),它没有权利再回到所有如果模式失败,则左侧的先前标记不再有权使用模式的另一个分支或字符串中的下一个位置重试所有匹配的字符,直到最后一个匹配的字符 (included)稍后在 (*SKIP) 动词的右侧.

When the regex engine meets the (*SKIP) verb (in this case all previous tokens have obviously succeeded), it has no right anymore to go back to all the previous tokens on the left and has no right anymore to retry all the matched characters with another branch of the pattern or at the next position in the string until the last matched character (included) if the pattern fails later on the right of the (*SKIP) verb.

(*FAIL) 的作用是强制模式失败.因此,在 (*SKIP) 左边匹配的所有字符都被跳过,正则表达式引擎在这些字符后继续其工作.

The role of (*FAIL) is to force the pattern to fail. Thus all the characters matched on the left of (*SKIP) are skipped and the regex engine continues its job after these characters.

示例模式中模式成功的唯一可能性是第一个分支在 (*SKIP) 之前失败以允许测试第二个分支.

The only possibility for the pattern to succeed in the example pattern is that the first branch fails before (*SKIP) to allow the second branch to be tested.

您可以在此处找到另一种解释.

回溯控制动词未在其他正则表达式引擎中实现,也没有等效项.

Backtracking control verbs are not implemented in other regex engines and there are no equivalent.

但是,您可以使用多种方法来做同样的事情(更清楚地说,避免可能与模式的其他部分匹配的东西).

However, you can use several ways to do the same (to be more clear, to avoid something that can be possibly matched by an other part of the pattern).

捕获组的使用:

方式一:

What_I_want_to_avoid|(What_I_want_to_match)

您只需要提取捕获组 1 (或测试它是否存在),因为它就是您要查找的内容.如果使用模式执行替换,则可以使用匹配结果的属性(偏移量、长度、捕获组)来使用经典字符串函数进行替换.其他语言,如 javascript、ruby... 允许使用回调函数作为替代.

You only need to extract the capture group 1 (or to test if it exists), since it is what you are looking for. If you use the pattern to perform a replacement, you can use the properties of the match result (offset, length, capture group) to make the replacement with classical string functions. Other language like javascript, ruby... allows to use a callback function as replacement.

方式二:

((?>To_avoid|Other_things_that_can_be_before_what_i_want)*)(What_I_want)

替换方式比较简单,不需要回调函数,替换字符串只需要以\1开头(or $1)

It's the more easy way for the replacement, no need to callback function, the replacement string need only to begin with \1 (or $1)

环视的使用:

例如,您想找到一个没有嵌入在其他两个词之间的词(假设 S_wordE_word 是不同的(请参阅 Qtax 评论)):

example, you want to find a word that is not embedded between two other words (lets say S_word and E_word that are different (see Qtax comment)):

(在这个例子中允许边缘情况S_word E_word word E_wordS_word word S_word E_word.)

回溯控制动词方式将是:

The backtracking control verb way will be:

S_word not_S_word_or_E_word E_word(*SKIP)(*F)|word

要使用这种方式,正则表达式引擎需要在一定程度上允许可变长度的lookbehinds.使用 .net 或新的 regex 模块,没问题,lookbehinds 可以具有完全可变的长度.Java 也可以,但必须限制大小(例如:(?<=.{1,1000})).

To use this way the regex engine needs to allow variable length lookbehinds to a certain extent. With .net or the new regex module, no problems, lookbehinds can have a totally variable length. It is possible with Java too but the size must be limited (example: (?<=.{1,1000})).

Java 等价物将是:

The Java equivalent will be:

word(?:(?!not_S_word_or_E_word E_word)|(?<!S_word not_E_word{0,1000} word))

请注意,在某些情况下,只有前瞻是必要的.另请注意,以文字字符开头的模式比以回顾开头更有效,这就是为什么我将它放在单词 之后(即使我需要在断言中再次重写该单词.)

Note that in some cases, only the lookahead is necessary. Note too that starting a pattern with literal character is more efficient than starting with a lookbehind, that's why I putted it after the word (even if I need to rewrite the word one more time in the assertion.)

这篇关于(*SKIP) 或 (*F) 如何处理正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆