正则表达式前瞻排序 [英] Regex lookahead ordering

查看:56
本文介绍了正则表达式前瞻排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很擅长正则表达式,现在我再次尝试理解前瞻和后视断言.它们大多是有道理的,但我不太确定顺序如何影响结果.我一直在查看 this site ,它在表达式之前放置了lookbehinds,在表达式之后放置了lookaheads表达.我的问题是,这会改变什么吗?最近在 SO 上的一个答案将前瞻放在了导致我困惑的表达式之前.

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.

推荐答案

当教程介绍环视时,他们倾向于为每个用例选择最简单的用例.所以他们会使用像 (?<!a)b('b' 前面没有 'a')或 q(?=u) ('q' 后跟 'u').这只是为了避免用分散注意力的细节来混淆解释,但它往往会创造(或强化)后视和前瞻应该以特定顺序出现的印象.我花了很长时间才克服了这个想法,我也看到其他几个人也受到了这个想法的困扰.

When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.

试着看一些更现实的例子.经常出现的一个问题是验证密码;例如,确保新密码长度至少为六个字符,并且至少包含一个字母和一个数字.一种方法是:

Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:

^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$

字符类 [A-Za-z0-9]{6,} 可以匹配所有字母或所有数字,因此您使用前瞻来确保每个字母至少有一个.在这种情况下,您必须进行前瞻,因为正则表达式的后面部分必须能够检查整个字符串.

The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.

再举一个例子,假设您需要查找所有出现的单词there",除非它前面有引号.明显的正则表达式是 (?<!")[Tt]here\b,但是如果您正在搜索大型语料库,则可能会产生性能问题.正如所写,该正则表达式将对文本中的每个位置进行否定回溯,只有当它成功时,它才会检查正则表达式的其余部分.

For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.

每个正则表达式引擎都有自己的长处和短处,但所有这些引擎都有一个特点,那就是它们比其他任何东西都能更快地找到固定的文字字符序列——序列越长越好.这意味着执行后视last可以显着更快,即使这意味着匹配单词两次:

Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:

[Tt]here\b(?<!"[Tt]here)

所以管理环视位置的规则是没有规则;在每种情况下,您都可以将它们放在最有意义的地方.

So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.

这篇关于正则表达式前瞻排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆