正则表达式PHP.减少步骤:受固定宽度限制 [英] Regex PHP. Reduce steps: limited by fixed width Lookbehind

查看:117
本文介绍了正则表达式PHP.减少步骤:受固定宽度限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正则表达式,用于匹配@users标签.

I have a regex that will be used to match @users tags.

我使用lokarround断言,让标点符号和空格字符包围标签.
复杂性增加了,有一种表示html的bbcode.
我有两种类型的bbcode,内联(^B粗体^b)和块(^C中心^c). 内联字符必须通过直通才能到达上一个或下一个字符. 就像标点符号一样,这些块也被允许围绕标签.

I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character. And the blocks are allowed to surround a tag, just like punctuation.

我制作了一个能正常工作的正则表达式.我现在想做的是减少它不会匹配的每个字符的步数.
起初我以为我可以做一个仅查找@的正则表达式,当它找到时,它将开始查找无内联bbcode的lookarrounds,但是由于lookbehind无法量化,因此更加困难,因为我无法在其中添加((\^[BIUbiu])++)*,产生更多的步骤.

I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for @, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.

如何以更少的步骤使我的正则表达式更有效?

How could I do my regex more efficient with fewer steps?

这是它的简化版本,在Regex101链接中有完整的正则表达式.

Here is a simplified version of it, in the Regex101 link there is the full regex.

(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*@([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])

https://regex101.com/r/lTPUOf/4/

推荐答案

经验法则:

如果出现以下情况,请勿让引擎尝试匹配每个单个字符 有一些界限.

Do not let engine make an attempt on matching each single one character if there are some boundaries.

引用最初来自此答案.遵循正则表达式会因为最外侧交替的左侧(从〜20000到〜900)而极大地减少了步长:

The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:

(?:[^@^]++|[@^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
    (\^[34biu78])*+@([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))

实际上,我不太在意regex101报告的步骤数,因为在您自己的环境中这不是正确的,并且不清楚某些步骤是真实的还是遗漏了哪些步骤.但是在这种情况下,由于正则表达式的逻辑很明确,并且差异很大,所以很有意义.

Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.

逻辑是什么?

我们首先尝试匹配可能根本不需要的内容,将其扔掉,然后寻找可能与我们的模式匹配的部分. [^@^]++最多匹配一个@^符号(所需的字符),并且[@^]{2,}+防止引擎在发现错误消息之前采取额外的步骤.因此,我们使其尽快失效.

We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^@^]++ matches up to a @ or ^ symbols (desired characters) and [@^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.

您可以使用i标志代替定义大写字母形式的字母(但是可能会产生一些影响).

You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).

请参见此处实时演示

这篇关于正则表达式PHP.减少步骤:受固定宽度限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆