我们如何匹配 a^n b^n? [英] How can we match a^n b^n?

查看:42
本文介绍了我们如何匹配 a^n b^n?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

这是一系列教育正则表达式文章的第二部分.它展示了如何使用前瞻和嵌套引用来匹配非常规语言 anbn.嵌套引用首先在:这个正则表达式如何找到三角形数?

典型的非常规语言之一是:

<块引用>

L = { anbn: n >0 }

这是所有非空字符串的语言,由一定数量的 a 和等量的 b 组成.这种语言中的字符串示例有 abaabbaaabbb.

这种语言可以通过抽引引理显示为非常规语言.它实际上是一种原型上下文无关语言,它可以由无上下文语法 S → aSb |ab.

尽管如此,现代正则表达式实现清楚地识别出的不仅仅是常规语言.也就是说,它们不是形式语言理论定义的常规".PCRE 和 Perl 支持递归正则表达式,.NET 支持平衡组定义.甚至更少的花哨"功能,例如反向引用匹配,表示正则表达式不正则.

但是这个基本"功能到底有多强大?例如,我们可以使用 Java 正则表达式识别 L 吗?我们能否将环视和嵌套引用结合起来,并有一个适用于例如的模式String.matches 匹配abaabbaaabbb等字符串?

参考资料

相关问题

解决方案

答案是,不用说,YES!你肯定可以写一个 Java 正则表达式来匹配 anbn.它对断言使用正向前瞻,对计数"使用一个嵌套引用.

这个答案不会立即给出模式,而是会引导读者完成推导它的过程.随着解决方案的慢慢构建,给出了各种提示.在这方面,希望这个答案不仅仅包含另一个简洁的正则表达式模式.希望读者也能学会如何用正则表达式思考",如何将各种结构和谐地组合在一起,以便日后自己推导出更多的模式.

用于开发解决方案的语言将是 PHP,因为它的简洁性.模式完成后的最终测试将在 Java 中完成.


第 1 步:前瞻断言

让我们从一个更简单的问题开始:我们想在字符串的开头匹配a+,但前提是它后面紧跟b+.我们可以使用 ^锚定我们的比赛,因为我们只想匹配 a+ 而没有 b+,我们可以使用 lookahead 断言 (?=…).

这是我们使用简单测试工具的模式:

function testAll($r, $tests) {foreach ($tests as $test) {$isMatch = preg_match($r, $test, $groups);$groupsJoined = join('|', $groups);打印($test $isMatch $groupsJoined\n");}}$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');$r1 = '/^a+(?=b+)/';# └────┘#         展望testAll($r1, $tests);

输出是(在 ideone.com 上看到的):

aaa 0aaab 1 aaa000阿布 1 a

这正是我们想要的输出:我们匹配 a+,仅当它位于字符串的开头,并且仅当它紧跟在 b+ 之后.>

教训:您可以在环顾中使用模式来做出断言.


第 2 步:在前瞻(和 f r e - s p a c i n g 模式)中捕获

现在假设即使我们不希望 b+ 成为匹配的一部分,我们也希望 捕获到第 1 组.此外,由于我们预计会有更复杂的模式,让我们使用 x 修饰符来表示 free-spacing 这样我们就可以使我们的正则表达式更具可读性.

在我们之前的 PHP 代码段的基础上,我们现在有以下模式:

$r2 = '/^ a+ (?= (b+) )/x';# │ └──┘ │# │ 1 │# └──────────┘#              展望testAll($r2, $tests);

输出现在是(在 ideone.com 上看到的):

aaa 0aaab 1 aaa|b000bbb 1 a|bbb

请注意,例如aaa|bjoin 的结果 - 每个组用 '|' 捕获的内容.在这种情况下,第 0 组(即匹配的模式)捕获了 aaa,第 1 组捕获了 b.

课程:您可以在环视中捕捉.您可以使用自由间距来增强可读性.


第 3 步:将前瞻重构为循环"

在介绍我们的计数机制之前,我们需要对我们的模式做一个修改.当前,前瞻在 + 重复循环"之外.到目前为止这很好,因为我们只是想断言在我们的 a+ 后面有一个 b+,但是我们最终真正想做的是断言对于我们在循环"中匹配的每个 a,都有一个相应的 b 与之配套.

我们暂时不用担心计数机制,只需进行如下重构:

  • 首先将 a+ 重构为 (?: a )+(注意 (?:...) 是一个非捕获组)
  • 然后在这个非捕获组内移动前瞻
    • 请注意,我们现在必须跳过"a* 在我们可以看到"之前b+,因此相应地修改模式

所以我们现在有以下内容:

$r3 = '/^ (?: a (?= a* (b+) ) )+/x';# │ │ └──┘ │ │# │ │ 1 │ │# │ └──────────────┘ │# │ 前瞻 │# └──────────────────────┘# 非捕获组

输出与以前相同(在 ideone.com 上看到),因此没有变化看待.重要的是,现在我们在+循环"的每次迭代上进行断言.对于我们当前的模式,这不是必需的,但接下来我们将使第 1 组计数"供我们使用自参考.

教训:您可以在非捕获组内进行捕获.环视可以重复.


第 4 步:这是我们开始计数的步骤

这是我们要做的:我们将重写第 1 组,以便:

  • +的第一次迭代结束时,当第一个a匹配时,应该捕获b
  • 在第二次迭代结束时,当另一个a匹配时,应该捕获bb
  • 在第三次迭代结束时,它应该捕获bbb
  • ...
  • n 次迭代结束时,第 1 组应捕获 bn
  • 如果没有足够的 b 来捕获到组 1,那么断言就会失败

因此,现在是 (b+) 的第 1 组必须重写为类似 (\1 b) 的内容.也就是说,我们尝试添加"a b 到第 1 组在前一次迭代中捕获的内容.

这里有一个小问题,这个模式缺少基本情况",即它可以在没有自引用的情况下进行匹配的情况.需要基本情况​​,因为组 1 开始未初始化";它尚未捕获任何内容(甚至不是空字符串),因此自引用尝试将始终失败.

有很多方法可以解决这个问题,但现在让我们将自我引用匹配可选,即 \1?.这可能会也可能不会完美地工作,但让我们看看它的作用,如果有任何问题,那么当我们到达时我们会越过那座桥.此外,我们将在此过程中添加更多测试用例.

$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb');$r4 = '/^ (?: a (?= a* (\1? b) ) )+/x';# │ │ └─────┘ |│# │ │ 1 |│# │ └────────────────┘ │# │ 前瞻 │# └────────────────────────┘# 非捕获组

输出现在是(在 ideone.com 上看到的):

aaa 0aaab 1 aaa|b # (*喘气!*)000abbb 1 a|b # 是的!aabb 1 aa|bb # 是的!!aaabbbbb 1 aaa|bbb # 是的!!!aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....

啊哈!看起来我们现在真的很接近解决方案了!我们设法让第 1 组计数".使用自参考!但是等等......第二个也是最后一个测试用例有问题!!b 不够,不知怎么算错了!我们将在下一步中研究为什么会发生这种情况.

课程:初始化"的一种方法自引用组是使自引用匹配成为可选的.


步骤 4½:了解问题所在

问题在于,由于我们将自引用匹配设为可选,因此计数器"不可用.可以重置"当 b 不够时返回 0.让我们以 aaaaabbb 作为输入,仔细检查我们模式的每次迭代时会发生什么.

 a a a a a b b b↑# 初始状态:组 1 为未初始化"._a a a a b b b↑# 第一次迭代:第 1 组无法匹配 \1,因为它未初始化",# 所以它匹配并捕获到 b___a a a a b b b↑# 第 2 次迭代:第 1 组匹配 \1b 并捕获 bb_____a a a a b b b↑# 第 3 次迭代:第 1 组匹配 \1b 并捕获 bbb_a a a a b b b↑# 第 4 次迭代:第 1 组仍然可以匹配 \1,但不能匹配 \1b,# (!!!) 所以它匹配并捕获了 b___a a a a b b b↑# 第 5 次迭代:第 1 组匹配 \1b 并捕获 bb## 没有更多的, + 循环";终止

啊哈!在我们的第 4 次迭代中,我们仍然可以匹配 \1,但是我们无法匹配 \1b!由于我们允许使用 \1? 选择自引用匹配,因此引擎回溯并接受不,谢谢"选项,然后允许我们匹配和捕获 b!

但是请注意,除了第一次迭代之外,您始终可以只匹配自引用 \1.当然,这是显而易见的,因为这是我们在上一次迭代中刚刚捕获的内容,并且在我们的设置中我们总是可以再次匹配它(例如,如果我们上次捕获了 bbb,我们可以保证有仍然是bbb,但这次可能有也可能没有bbbb).

教训:谨防回溯.正则表达式引擎将尽可能多地进行回溯,直到给定的模式匹配为止.这可能会影响性能(即灾难性回溯)和/或正确性.


第 5 步:自我占有来拯救!

修复"现在应该很明显了:将可选的重复与 possessive 量词结合起来.也就是说,不是简单的 ?,而是使用 ?+(记住,被量化为所有格的重复不会回溯,即使这种合作"可能导致整体模式的匹配).

用非常非正式的术语来说,这就是 ?+??? 所说的:

<块引用>

?+

  • (可选)它不必在那里,"
    • (占有)但如果它在那里,你必须接受它而不是放手!"

?

  • (可选)它不必在那里,"
    • (贪婪)但如果是的话,你可以暂时接受,"
      • (回溯)但您可能会被要求稍后放手!"

??

  • (可选)它不必在那里,"
    • (不情愿)即使是这样,你也不必马上接受,"
      • (回溯)但您可能会被要求稍后接受!"

在我们的设置中,\1 不会第一次出现,但它会总是在那之后的任何时候出现,我们总是 然后想匹配它.因此, \1?+ 将完成我们想要的.

$r5 = '/^ (?: a (?= a* (\1?+ b) ) )+/x';# │ │ └────────┘ │ │# │ │ 1 │ │# │ └───────────────────┘ │# │ 前瞻 │# └───────────────────────────┘# 非捕获组

现在输出是(在 ideone.com 上看到的):

aaa 0aaab 1 a|b # 耶!固定的!0夏令营 00abb 1 a|baabb 1 aa|bbaaabbbbb 1 aaa|bbbaaaaabbb 1 aaa|bbb # 万岁!!!

瞧!!!问题解决了!!!我们现在正按照我们想要的方式正确计数!

教训:了解贪婪、不情愿和占有欲重复之间的区别.Optional-possessive 可以是一个强大的组合.


第 6 步:收尾

所以我们现在拥有的是一个重复匹配a的模式,并且对于每个匹配的a,都有一个对应的b> 在第 1 组中捕获. + 在没有更多 a 时终止,或者如果断言因没有相应的 b 而失败对于 a.

为了完成这项工作,我们只需要附加到我们的模式 \1 $.现在这是对第 1 组匹配内容的反向引用,后跟行锚点的结尾.锚点确保字符串中没有任何额外的 b ;换句话说,实际上我们有anbn.

这是最终的模式,带有额外的测试用例,包括一个长度为 10,000 个字符的测试用例:

$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb','', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',str_repeat('a', 5000).str_repeat('b', 5000));$r6 = '/^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x';# │ │ └────────┘ │ │# │ │ 1 │ │# │ └───────────────────┘ │# │ 前瞻 │# └───────────────────────────┘# 非捕获组

它找到 4 个匹配项:abaabbaaabbba5000b5000.在 ideone.com 上运行仅需 0.06 秒.


第 7 步:Java 测试

所以该模式在 PHP 中有效,但最终目标是编写一个在 Java 中有效的模式.

public static void main(String[] args) {String aNbN = "(?x) (?: a (?= a* (\\1?+ b)) )+ \\1";字符串 [] 测试 = {"",//假"ab",//真"abb",//假"aab",//假"aabb",//真abab",//假"abc",//假repeat('a', 5000) + repeat('b', 4999),//假repeat('a', 5000) + repeat('b', 5000),//真repeat('a', 5000) + repeat('b', 5001),//假};对于(字符串测试:测试){System.out.printf("[%s]%n %s%n%n", test, test.matches(aNbN));}}静态字符串重复(char ch,int n){return new String(new char[n]).replace('\0', ch);}

该模式按预期工作(在 ideone.com 上看到).


现在我们得出结论......

需要说明的是,前瞻中的 a* 和主 + 循环"都允许回溯.鼓励读者确认为什么这在正确性方面不是问题,以及为什么同时使两个所有格都有效(尽管在同一模式中混合强制性和非强制性所有格可能会导致误解).

还应该说,虽然有一个匹配 anbn 的正则表达式模式很整洁,但这不是永远是最好的"在实践中解决.更好的解决方案是简单地匹配 ^(a+)(b+)$,然后比较托管编程语言中第 1 组和第 2 组捕获的字符串的长度.

在 PHP 中,它可能看起来像这样(在 ideone.com 中看到):

function is_anbn($s) {return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&(strlen($groups[1]) == strlen($groups[2]));}

本文的目的不是让读者相信正则表达式几乎可以做任何事情;它显然不能,即使对于它可以做的事情,如果它导致更简单的解决方案,至少应该考虑将部分委派给托管语言.

正如顶部所提到的,虽然这篇文章必须为 stackoverflow 标记为 [regex],但它可能不止于此.虽然学习断言、嵌套引用、所有格量词等当然很有价值,但也许这里更大的教训是一个人可以尝试解决问题的创造性过程,当你遇到问题时通常需要决心和努力工作各种约束条件、各个部分的系统组合以构建工作解决方案等.


奖励材料!PCRE 递归模式!

既然我们确实提出了PHP,那么需要说明的是PCRE支持递归模式和子程序.因此,以下模式适用于 preg_match(在 ideone.com 上看到):

$rRecursive = '/^ (a (?1)? b) $/x';

目前 Java 的正则表达式不支持递归模式.


更多的奖励材料!匹配anbncn !!

所以我们已经看到了如何匹配 anbn 这是非常规的,但仍然与上下文无关,但可以我们也匹配anbncn,这甚至不是上下文无关的?

答案当然是YES!鼓励读者尝试自己解决这个问题,但下面提供了解决方案(在 ideone.com 上用 Java 实现).

^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $

This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?

One of the archetypal non-regular languages is:

L = { anbn: n > 0 }

This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.

This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.

Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.

But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?

References

Linked questions

解决方案

The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".

Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.

The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.


Step 1: Lookahead for assertion

Let's start with a simpler problem: we want to match a+ at the beginning of a string, but only if it's followed immediately by b+. We can use ^ to anchor our match, and since we only want to match the a+ without the b+, we can use lookahead assertion (?=…).

Here is our pattern with a simple test harness:

function testAll($r, $tests) {
   foreach ($tests as $test) {
      $isMatch = preg_match($r, $test, $groups);
      $groupsJoined = join('|', $groups);
      print("$test $isMatch $groupsJoined\n");
   }
}
 
$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');
 
$r1 = '/^a+(?=b+)/';
#          └────┘
#         lookahead

testAll($r1, $tests);

The output is (as seen on ideone.com):

aaa 0
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a

This is exactly the output we want: we match a+, only if it's at the beginning of the string, and only if it's immediately followed by b+.

Lesson: You can use patterns in lookarounds to make assertions.


Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)

Now let's say that even though we don't want the b+ to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's use x modifier for free-spacing so we can make our regex more readable.

Building on our previous PHP snippet, we now have the following pattern:

$r2 = '/ ^ a+ (?= (b+) ) /x';
#             │   └──┘ │
#             │     1  │
#             └────────┘
#              lookahead
 
testAll($r2, $tests);

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b
aaaxb 0
xaaab 0
b 0
abbb 1 a|bbb

Note that e.g. aaa|b is the result of join-ing what each group captured with '|'. In this case, group 0 (i.e. what the pattern matched) captured aaa, and group 1 captured b.

Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.


Step 3: Refactoring the lookahead into the "loop"

Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the + repetition "loop". This is fine so far because we just wanted to assert that there's a b+ following our a+, but what we really want to do eventually is assert that for each a that we match inside the "loop", there's a corresponding b to go with it.

Let's not worry about the counting mechanism for now and just do the refactoring as follows:

  • First refactor a+ to (?: a )+ (note that (?:…) is a non-capturing group)
  • Then move the lookahead inside this non-capturing group
    • Note that we must now "skip" a* before we can "see" the b+, so modify the pattern accordingly

So we now have the following:

$r3 = '/ ^ (?: a (?= a* (b+) ) )+ /x';
#          │     │      └──┘ │ │
#          │     │        1  │ │
#          │     └───────────┘ │
#          │       lookahead   │
#          └───────────────────┘
#           non-capturing group

The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the + "loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.

Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.


Step 4: This is the step where we start counting

Here's what we're going to do: we'll rewrite group 1 such that:

  • At the end of the first iteration of the +, when the first a is matched, it should capture b
  • At the end of the second iteration, when another a is matched, it should capture bb
  • At the end of the third iteration, it should capture bbb
  • ...
  • At the end of the n-th iteration, group 1 should capture bn
  • If there aren't enough b to capture into group 1 then the assertion simply fails

So group 1, which is now (b+), will have to be rewritten to something like (\1 b). That is, we try to "add" a b to what group 1 captured in the previous iteration.

There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.

There are many ways around this, but for now let's just make the self-reference matching optional, i.e. \1?. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb'
);
 
$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+ /x';
#          │     │      └─────┘ | │
#          │     │         1    | │
#          │     └──────────────┘ │
#          │         lookahead    │
#          └──────────────────────┘
#             non-capturing group

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b        # (*gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a|b          # yes!
aabb 1 aa|bb        # YES!!
aaabbbbb 1 aaa|bbb  # YESS!!!
aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....

A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough bs, and somehow it counted wrong! We'll examine why this happened in the next step.

Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.


Step 4½: Understanding what went wrong

The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough b's. Let's closely examine what happens at every iteration of our pattern with aaaaabbb as input.

 a a a a a b b b
↑
# Initial state: Group 1 is "uninitialized".
           _
 a a a a a b b b
  ↑
  # 1st iteration: Group 1 couldn't match \1 since it was "uninitialized",
  #                  so it matched and captured just b
           ___
 a a a a a b b b
    ↑
    # 2nd iteration: Group 1 matched \1b and captured bb
           _____
 a a a a a b b b
      ↑
      # 3rd iteration: Group 1 matched \1b and captured bbb
           _
 a a a a a b b b
        ↑
        # 4th iteration: Group 1 could still match \1, but not \1b,
        #  (!!!)           so it matched and captured just b
           ___
 a a a a a b b b
          ↑
          # 5th iteration: Group 1 matched \1b and captured bb
          #
          # No more a, + "loop" terminates

A-ha! On our 4th iteration, we could still match \1, but we couldn't match \1b! Since we allow the self-reference matching to be optional with \1?, the engine backtracks and took the "no thanks" option, which then allows us to match and capture just b!

Do note, however, that except on the very first iteration, you could always match just the self-reference \1. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we captured bbb last time, we're guaranteed that there will still be bbb, but there may or may not be bbbb this time).

Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.


Step 5: Self-possession to the rescue!

The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply ?, use ?+ instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).

In very informal terms, this is what ?+, ? and ?? says:

?+

  • (optional) "It doesn't have to be there,"
    • (possessive) "but if it is there, you must take it and not let go!"

?

  • (optional) "It doesn't have to be there,"
    • (greedy) "but if it is you can take it for now,"
      • (backtracking) "but you may be asked to let it go later!"

??

  • (optional) "It doesn't have to be there,"
    • (reluctant) "and even if it is you don't have to take it just yet,"
      • (backtracking) "but you may be asked to take it later!"

In our setup, \1 will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus, \1?+ would accomplish exactly what we want.

$r5 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

Now the output is (as seen on ideone.com):

aaa 0
aaab 1 a|b          # Yay! Fixed!
aaaxb 0
xaaab 0
b 0
abbb 1 a|b
aabb 1 aa|bb
aaabbbbb 1 aaa|bbb
aaaaabbb 1 aaa|bbb  # Hurrahh!!!

Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!

Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.


Step 6: Finishing touches

So what we have right now is a pattern that matches a repeatedly, and for every a that was matched, there is a corresponding b captured in group 1. The + terminates when there are no more a, or if the assertion failed because there isn't a corresponding b for an a.

To finish the job, we simply need to append to our pattern \1 $. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extra b's in the string; in other words, that in fact we have anbn.

Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb',
  '', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',
  str_repeat('a', 5000).str_repeat('b', 5000)
);
 
$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

It finds 4 matches: ab, aabb, aaabbb, and the a5000b5000. It takes only 0.06s to run on ideone.com.


Step 7: The Java test

So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.

public static void main(String[] args) {
 
        String aNbN = "(?x) (?:  a  (?= a* (\\1?+ b))  )+ \\1";
        String[] tests = {
                "",      // false
                "ab",    // true
                "abb",   // false
                "aab",   // false
                "aabb",  // true
                "abab",  // false
                "abc",   // false
                repeat('a', 5000) + repeat('b', 4999), // false
                repeat('a', 5000) + repeat('b', 5000), // true
                repeat('a', 5000) + repeat('b', 5001), // false
        };
        for (String test : tests) {
                System.out.printf("[%s]%n  %s%n%n", test, test.matches(aNbN));
        }
 
}
 
static String repeat(char ch, int n) {
        return new String(new char[n]).replace('\0', ch);
}

The pattern works as expected (as seen on ideone.com).


And now we come to the conclusion...

It needs to be said that the a* in the lookahead, and indeed the "main + loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).

It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match ^(a+)(b+)$, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.

In PHP, it may look something like this (as seen in ideone.com):

function is_anbn($s) {
   return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&
      (strlen($groups[1]) == strlen($groups[2]));
}

The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.

As mentioned at the top, while this article is necessarily tagged [regex] for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.


Bonus material! PCRE recursive pattern!

Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for preg_match (as seen on ideone.com):

$rRecursive = '/ ^ (a (?1)? b) $ /x';

Currently Java's regex does not support recursive pattern.


Even more bonus material! Matching anbncn !!

So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?

The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).

^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $

这篇关于我们如何匹配 a^n b^n?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆