为什么这三个正则表达式的步数不同? [英] Why do these three regexes have different step counts?

查看:37
本文介绍了为什么这三个正则表达式的步数不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这次你信了吗?是的.不,让我们看看.

PCRE 启动优化

PCRE 善待其用户,它提供了一些功能来加速称为启动优化的事情.它根据使用的正则表达式进行一些相关优化.

这些优化的一个重要功能是对主题字符串进行预扫描,以确保:

  • 主题字符串至少包含一个对应于匹配或第一个字符的字符
  • 如果存在已知的起点.

如果没有找到匹配的函数,则永远不会运行.

也就是说,如果我们的正则表达式是 /x/ 并且我们的主题字符串是 abc 然后启用启动优化,则打算进行预扫描寻找x,如果没有找到,整个匹配失败或者更好,它甚至不费心去经历匹配过程.

那么这些信息有何帮助?

让我们回到我们的第一个例子并稍微改变我们的正则表达式.来自:

/^x/gm

/^x/g

区别在于 m 标志未设置.对于那些不知道如果设置了 m 标志会做什么的人:

它改变了 ^$ 符号的含义,因为它们不再表示字符串的开始和结束,而是表示行的开始和结束.

现在,如果我们在主题字符串 abc 上运行这个正则表达式 /^x/g 会怎样?我们是否应该期望引擎采取的步骤有所不同?绝对是的.让我们看看

这次我真的鼓励你相信它.是真实的.

发生了什么?

好吧,这似乎有点令人困惑,但我们将对此有所启发.当没有 m 修饰符集时,预扫描看起来断言字符串的开始(一个已知的起点).如果断言通过,则运行实际匹配函数,否则将返回不匹配".

但是等等...每个主题字符串肯定有一个并且只有一个字符串位置的开始,并且它总是在它的最开始.那么显然不需要预扫描吗?是的,引擎在此处不进行预扫描.使用 /^x/g 它立即断言字符串的开始,然后像这样失败(因为它在 ^ 处匹配,所以它经历了实际的匹配过程).这就是为什么我们看到

调试器显示 6 步,但主页显示 0 步,为什么?

我不确定后者,但为了调试,regex101 调试器使用 (*NO_START_OPT) 运行,因此仅当设置了此谓词时,6 个步骤才成立.我说我不确定后者,因为所有锚定模式都会阻止进一步的预扫描优化,我们应该知道什么可以称为

引擎知道字符串锚点\A(或^,而多行模式禁用时)的开始仅在匹配时出现一次,因此它不会在下一个位置继续.

回到你自己的正则表达式

前两个是锚定的(^ 结合 m 标志),第三个不是.也就是说,第三个正则表达式受益于预扫描优化.您可以相信 35 个步骤,因为它是由优化引起的.但是如果你禁用启动优化:

(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))

您将看到 57 个步骤,这与调试器步骤的数量大致相同.

In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.

The three patterns are as follows:

  • Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
  • Classic alternation: ^!next|!next$ (demo - 58 steps)
  • Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)

Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?

解决方案

TL;DR

Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?

Because first two are anchored, third is not.

Real story, how steps are taken

Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.

  1. Assert beginning of string
  2. Match x

Then overall match fails since no x immediately comes after beginning of string assertion.

Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:

Shall you believe it this time? Yes. No. Let's see.

PCRE start-up optimizations

PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.

One important feature of these optimizations is a pre-scan of subject string in order to ensure that:

  • subject string contains at least one character that corresponds to the first character of match or
  • if a known starting point exists.

If one is not found matching function never runs.

Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.

So how do these information help?

Let's flashback to our first example and change our regex a little bit. From:

/^x/gm

to

/^x/g

The difference is m flag that is getting unset. For those who don't know what m flag does if is set:

It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.

Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:

I really encourage you to believe it this time. It's actual.

What's happening?

Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.

But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.

But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.

Debugger shows 6 steps but main page says 0 steps, why?

I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:

A pattern is automatically anchored by PCRE if all of its top-level alternatives begin with one of the following:

  • ^ unless PCRE_MULTILINE is set
  • \A always
  • \G always
  • .* if PCRE_DOTALL is set and there are no back references to the subpattern in which .* appears

Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.

Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.

Back to your own RegExes

First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:

(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))

You will see 57 steps which is mostly the same as number of debugger steps.

这篇关于为什么这三个正则表达式的步数不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆