我们怎样才能将^ n b ^ n与Java正则表达式匹配? [英] How can we match a^n b^n with Java regex?

查看:178
本文介绍了我们怎样才能将^ n b ^ n与Java正则表达式匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


这是一系列教育正则表达式文章的第二部分。它显示了前瞻和嵌套引用如何用于匹配非常规语言 n b n 。嵌套引用首先介绍于:此正则表达式如何找到三角形数字?


其中一个原型非常规语言是:


L = {a n b n :n> ; 0}


这是所有非空字符串的语言,由一些<$ c组成$ c> a 之后是相同数量的 b 。此语言的字符串示例为 ab aabb aaabbb



抽取引理。它实际上是一个典型的无上下文语言,可以由无上下文语法 S→aSb | ab



尽管如此,现代正则表达式实现清楚地认识到的不仅仅是常规语言。也就是说,它们不是形式语言理论定义的规则。 PCRE和Perl支持递归正则表达式,.NET支持平衡组定义。甚至更少奇特的功能,例如反向引用匹配,意味着正则表达式不规则。



但这个基本功能有多强大?例如,我们可以用Java正则表达式识别 L 吗?我们是否可以组合使用lookarounds和嵌套引用,并且可以使用与 String.matches 匹配 ab aabb 等字符串, aaabbb 等?



参考文献





已关联的问题




解决方案

答案是,不用说,是!你当然可以编写一个Java正则表达式模式来匹配名词 b'SUP>名词 的。它使用正向前导进行断言,并使用一个嵌套引用进行计数。



这个答案不会立即给出模式,而是引导读者通过过程派生它。随着解决方案的缓慢构建,给出了各种提示。在这方面,希望这个答案不仅包含另一个整洁的正则表达式模式。希望读者也能学会如何在正则表达式中思考,以及如何将各种结构和谐地结合在一起,这样他们将来可以自己获得更多的模式。



用于开发解决方案的语言将是PHP的简洁性。模式完成后的最终测试将在Java中完成。






步骤1:断言前瞻



让我们从一个更简单的问题开始:我们希望在字符串的开头匹配 a + ,但前提是它紧跟着 b + 。我们可以使用 ^ 锚点我们的比赛,因为我们只想匹配 a + 而没有 b + ,我们可以使用 lookahead 断言(?= ...)



这是我们的模式,带有一个简单的测试工具:

  function testAll($ r,$ tests){
foreach($ tests as $ test){
$ isMatch = preg_match($ r,$ test,$ groups);
$ groupsJoined = join('|',$ groups);
print($ test $ isMatch $ groupsJoined\\\
);
}
}

$ tests = array('aaa','aaab','aaaxb','xaaab','b','abbb');

$ r1 ='/ ^ a +(?= b +)/';
#└────
#lookahead

testAll($ r1,$ tests);

输出为(见于ideone.com ):

  aaa 0 
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a

这正是我们想要的输出:我们匹配 a + ,只有它位于字符串的开头,并且只有当它紧跟着 b +



课程 :您可以使用外观模式进行断言。






第2步:捕捉前瞻(和自由间隔模式)



现在让我们说即使我们不希望 b + 成为比赛的一部分,我们也希望将其捕获到组1中。另外,由于我们预计会有更复杂的模式,让我们使用 x 自由间距因此我们可以使我们的正则表达式更具可读性。



建立在我们的基础之上以前的PHP代码段,我们现在有以下模式:

  $ r2 ='/ ^ a +(?=(b +))/ X'; 
#│└──┘│
#│1│
#└────────
#lookahead

testAll( $ r2,$ tests);

现在输出(见于ideone.com ):

  aaa 0 
aaab 1 aaa | b
aaaxb 0
xaaab 0
b 0
abbb 1 a | bbb

请注意,例如 aaa | b 加入的结果 - 用'|'捕获每个组的结果。在这种情况下,组0(即匹配的模式)捕获 aaa ,组1捕获 b



课程 :您可以在一个环视中捕获。您可以使用自由间距来增强可读性。






步骤3:将前瞻重构为循环



在我们介绍计数机制之前,我们需要对模式进行一次修改。目前,前瞻是在 + 重复循环之外。到目前为止这很好,因为我们只是想断言我们的 a + 后面有一个 b + ,但我们真的最终要做的是断言,对于我们在循环中匹配的每个 a ,有一个相应的 b 配合它。



现在让我们不要担心计数机制,只需按照以下步骤进行重构:




  • 第一个重构 a + (?:a)+ (请注意(?: ...)是非捕获组)

  • 然后在非捕获组内移动前瞻


    • 请注意,我们现在必须跳过 a * 才能看到 b + ,因此相应地修改模式




所以我们现在有以下内容:

  $ r3 ='/ ^(?:a(?= a *(b +) )))+ / x'; 
#││└──┘││
#││1││
#│└──────────────────── │
#└─────────────
#非捕获组

输出与以前相同(在ideone.com上看到) ),所以在这方面没有变化。重要的是,现在我们在每次迭代 + 循环中进行断言。根据我们目前的模式,这不是必要的,但接下来我们将使用自我引用使第1组计数。



课程 :您可以在非捕获组内捕获。可以重复看看。






步骤4:这是我们开始计算的步骤



以下是我们要做的事情:我们将重写第1组:




  • 最后 + 的第一次迭代,当第一个 a 匹配时,它应该捕获 b

  • 在第二次迭代结束时,当另一个 a 匹配时,它应该捕获 bb

  • 在第三次迭代结束时,它应该捕获 bbb

  • ...

  • n -th迭代结束时,第1组应捕获 b < sup> n

  • 如果没有足够的 b 来捕获到组1那么断言只是失败



所以第1组,现在是(b +),将必须重写为(\ 1b)。也就是说,我们尝试将 b 添加到上一次迭代中捕获的组1。



这是这里的一个小问题是,这种模式缺少基本情况,即它可以在没有自引用的情况下匹配的情况。需要一个基本案例,因为组1开始未初始化;它还没有捕获任何东西(甚至没有空字符串),所以自引用尝试总是会失败。



有很多方法可以解决这个问题,但是现在让我们进行自我引用匹配可选,即 \ 1 。这可能会或可能不会完美,但让我们看看它是做什么的,如果有任何问题,那么当我们来到它时,我们将跨越那座桥梁。此外,我们还会添加一些测试用例。

  $ tests = array(
'aaa','aaab','aaaxb','xaaab','b','abbb','aabb','aaabbbbb','aaaaabbb'
);

$ r4 ='/ ^(?:a(?= a *(\ 1?b)))+ / x';
#││└──────| │
#││1| │
#│└───────────────────────────────────────────────────────────────────────────────────────────────────────── ──────────
#非捕获组

输出现在(在ideone.com上看到):

  aaa 0 
aaab 1 aaa | b#(* gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a | b#yes!
aabb 1 aa | bb#YES !!
aaabbbbb 1 aaa | bbb #YESS !!!
aaaaabbb 1 aaaaa | bb#NOOOOOoooooo ....

A-ha!看起来我们现在非常接近解决方案!我们设法使用自我引用让第1组计数!但是等等......第二次和最后一次测试用例出了问题!没有足够的 b s,不知何故错了!我们将在下一步中检查这种情况发生的原因。



课程 :初始化a的一种方法自引用组是使自引用匹配可选。






步骤4½:了解出了什么问题



问题在于,由于我们将自引用匹配设置为可选,因此当没有足够的 b时,计数器可以重置回0 的。让我们仔细研究一下我们模式的每次迭代会发生什么,并输入 aaaaabbb

  aaaaabbb 

#初始状态:第1组是未初始化。
_
aaaaabbb

#1st iteration:第1组无法与\1匹配,因为它是未初始化,
#所以匹配并捕获只是b
___
aaaaabbb

#2nd迭代:第1组匹配\ 1b并且捕获bb
_____
aaaaabbb

#第3次迭代:第1组匹配\ 1b并且捕获bbb
_
aaaaabbb

#第4次迭代:第1组仍然可以匹配\ 1,但不能匹配\ 1b,
#(!!!)所以它匹配并捕获了b
___
aaaaabbb

#5th迭代:第1组匹配\ 1b并且捕获了bb

#不再a,+循环终止

A-ha!在我们的第四次迭代中,我们仍然可以匹配 \1 ,但我们无法匹配 \ lb !由于我们允许自引用匹配对于 \1?是可选的,引擎回溯并采取不用谢谢选项,然后允许我们匹配和捕获只需 b



请注意,除非在第一次迭代时,您始终可以匹配自我引用 \1 。当然,这是显而易见的,因为它是我们刚刚在前一次迭代中捕获的内容,在我们的设置中我们可以再次匹配它(例如,如果我们上次捕获 bbb ,我们保证仍会有 bbb ,但这次可能有 bbbb



课程 :谨防回溯。正则表达式引擎将执行尽可能多的回溯,直到给定的模式匹配为止。这可能会影响性能(即灾难性回溯)和/或正确性。






第5步:自救到救援!



修复现在应该是显而易见的:将可选重复与占有式量词结合起来。也就是说,而不是简单地,而是使用?+ (请记住,量化为占有的重复不会回溯,即使这样的合作可能导致整体模式的匹配)。



非常非正式地说,这就是?+ ?? 说:


?+




  • (可选)它不必在那里,


    • (占有欲)但如果它在那里,你必须接受它而不是放手!







  • (可选)它不必存在,


    • (贪婪)但如果它现在可以拿走,


      • (回溯)但你可能会被要求让它去吧!





??




  • (可选)它不必在那里,


    • (不情愿)即使它你是不是还没拿它,


      • (回溯)但你可能会被要求以后再拿它!




在我们的设置中, \1 将不会在第一次出现,但在此之后总是在那里,我们总是想要匹配它。因此, \1?+ 将完全符合我们的要求。

  $ r5 ='/ ^(?:a(?= a *(\\\?+ + b)))+ / x'; 
#││└─────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
#│lookahead│
#└───────────────
#非捕获组

现在输出为(见于ideone.com ):

  aaa 0 
aaab 1 a | b # 好极了!固定!
aaaxb 0
xaaab 0
b 0
abbb 1 a | b
aabb 1 aa | bb
aaabbbbb 1 aaa | bbb
aaaaabbb 1 aaa | bbb #Hurrahh !!!

Voilà!!!问题解决了!!!我们正在按照我们想要的方式正确计算!



课程 :了解贪婪之间的区别,不情愿和占有欲的重复。可选占有可以是一个强大的组合。






第6步:完成接触



所以我们现在所拥有的模式是重复匹配 a 的模式,以及每个 a 的模式匹配后,组1中捕获了相应的 b + 在没有<$ c时终止$ c> a ,或者如果断言失败,因为 a b c $ c>。



要完成这项工作,我们只需要附加到我们的模式 \1 $ 。现在这是对第1组匹配的后向引用,后面是行锚的结束。锚点确保字符串中没有任何额外的 b ;换句话说,事实上我们 a n b n



这是最终的模式,包含额外的测试用例,包括10,000个字符长的测试用例:

  $ tests = array(
'aaa','aaab','aaaxb','xaaab','b','abbb','aabb','aaabbbbb','aaaaabbb',
'','ab','abb' ,'aab','aaaabb','aababbb','bbbaaa','ababab','abc',
str_repeat('a',5000).str_repeat('b',5000)
);

$ r6 ='/ ^(?:a(?= a *(\ 1?+ b)))+ \ $ $ / x';
#││└─────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
#│lookahead│
#└───────────────
#非捕获组

它找到4个匹配项: ab aabb aaabbb a 5000 b 5000 只需0.06秒即可在ideone.com上运行



< hr>

第7步:Java测试



因此该模式适用于PHP,但最终目标是编写模式在Java中工作。

  public static void main(String [] args){

String aNbN =(?x)(?:a(?= a *(\\1?+ b)))+ \\\\;
String [] tests = {
,// false
ab,// true
abb,// false
aab, // false
aabb,// true
abab,// false
abc,// false
repeat('a',5000)+重复('b',4999),// false
repeat('a',5000)+ repeat('b',5000),// true
repeat('a',5000)+ repeat ('b',5001),// false
};
for(String test:tests){
System.out.printf([%s]%n%s%n%n,test,test.matches(aNbN));
}

}

static String repeat(char ch,int n){
返回new String(new char [n])。replace( '\0',ch);
}

模式按预期工作(见于ideone.com )。






和现在我们得出结论...



需要说明前瞻中的 a * ,实际上主 + 循环都允许回溯。鼓励读者确认为什么这在正确性方面不是问题,为什么同时使两者都具有占有性也会起作用(尽管在相同的模式中混合强制性和非强制性占有量词可能会导致误解)。 / p>

还应该说,虽然它很整齐,但是有一个正则表达式模式将匹配 a n b n ,这在实践中并不总是最佳解决方案。更好的解决方案是简单地匹配 ^(a +)(b +)$ ,然后比较托管编程语言中第1组和第2组捕获的字符串的长度。



在PHP中,它可能看起来像这样(在ideone中看到.com ):

  function is_anbn($ s){
return(preg_match('/ ^) (a +)(b +)$ /',$ s,$ groups))&&
(strlen($ groups [1])== strlen($ groups [2]));
}

本文的目的是 NOT 来说服读者,正则表达式几乎可以做任何事情;它显然不能,甚至对于它可以做的事情,如果它导致一个更简单的解决方案,至少应该考虑部分授权给托管语言。



As在顶部提到,虽然这篇文章必须标记为 [regex] for stackoverflow,但它可能不止于此。虽然在学习断言,嵌套参考,占有量词等方面肯定有价值,但也许更重要的教训是人们可以尝试解决问题的创造性过程,当你受到影响时经常需要的决心和努力工作。各种约束,各个部分的系统构成,以构建工作解决方案等。






奖励材料! PCRE递归模式!



由于我们确实提出了PHP,因此需要说PCRE支持递归模式和子程序。因此,以下模式适用于 preg_match 在ideone.com上看到):

  $ rRecursive ='/ ^(a(?1)?b)$ / x'; 

目前Java的正则表达式不支持递归模式。






更多奖励材料!匹配 a n b n c n !!



<那么我们已经看到如何匹配 a n b n 这是非常规的,但仍然没有上下文,但我们能不能还匹配 a n b n c n ,甚至没有上下文?



答案当然是 YES!鼓励读者自己尝试解决这个问题,但下面提供了解决方案(在ideone.com上用Java实现)。


^(?:a(?= a *(\1?+ b)b *(\2?+ c)))+ \\\ \ 2 $



This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?

One of the archetypal non-regular languages is:

L = { anbn: n > 0 }

This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.

This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.

Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.

But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?

References

Linked questions

解决方案

The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".

Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.

The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.


Step 1: Lookahead for assertion

Let's start with a simpler problem: we want to match a+ at the beginning of a string, but only if it's followed immediately by b+. We can use ^ to anchor our match, and since we only want to match the a+ without the b+, we can use lookahead assertion (?=…).

Here is our pattern with a simple test harness:

function testAll($r, $tests) {
   foreach ($tests as $test) {
      $isMatch = preg_match($r, $test, $groups);
      $groupsJoined = join('|', $groups);
      print("$test $isMatch $groupsJoined\n");
   }
}

$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');

$r1 = '/^a+(?=b+)/';
#          └────┘
#         lookahead

testAll($r1, $tests);

The output is (as seen on ideone.com):

aaa 0
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a

This is exactly the output we want: we match a+, only if it's at the beginning of the string, and only if it's immediately followed by b+.

Lesson: You can use patterns in lookarounds to make assertions.


Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)

Now let's say that even though we don't want the b+ to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's use x modifier for free-spacing so we can make our regex more readable.

Building on our previous PHP snippet, we now have the following pattern:

$r2 = '/ ^ a+ (?= (b+) ) /x';
#             │   └──┘ │
#             │     1  │
#             └────────┘
#              lookahead

testAll($r2, $tests);

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b
aaaxb 0
xaaab 0
b 0
abbb 1 a|bbb

Note that e.g. aaa|b is the result of join-ing what each group captured with '|'. In this case, group 0 (i.e. what the pattern matched) captured aaa, and group 1 captured b.

Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.


Step 3: Refactoring the lookahead into the "loop"

Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the + repetition "loop". This is fine so far because we just wanted to assert that there's a b+ following our a+, but what we really want to do eventually is assert that for each a that we match inside the "loop", there's a corresponding b to go with it.

Let's not worry about the counting mechanism for now and just do the refactoring as follows:

  • First refactor a+ to (?: a )+ (note that (?:…) is a non-capturing group)
  • Then move the lookahead inside this non-capturing group
    • Note that we must now "skip" a* before we can "see" the b+, so modify the pattern accordingly

So we now have the following:

$r3 = '/ ^ (?: a (?= a* (b+) ) )+ /x';
#          │     │      └──┘ │ │
#          │     │        1  │ │
#          │     └───────────┘ │
#          │       lookahead   │
#          └───────────────────┘
#           non-capturing group

The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the + "loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.

Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.


Step 4: This is the step where we start counting

Here's what we're going to do: we'll rewrite group 1 such that:

  • At the end of the first iteration of the +, when the first a is matched, it should capture b
  • At the end of the second iteration, when another a is matched, it should capture bb
  • At the end of the third iteration, it should capture bbb
  • ...
  • At the end of the n-th iteration, group 1 should capture bn
  • If there aren't enough b to capture into group 1 then the assertion simply fails

So group 1, which is now (b+), will have to be rewritten to something like (\1 b). That is, we try to "add" a b to what group 1 captured in the previous iteration.

There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.

There are many ways around this, but for now let's just make the self-reference matching optional, i.e. \1?. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb'
);

$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+ /x';
#          │     │      └─────┘ | │
#          │     │         1    | │
#          │     └──────────────┘ │
#          │         lookahead    │
#          └──────────────────────┘
#             non-capturing group

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b        # (*gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a|b          # yes!
aabb 1 aa|bb        # YES!!
aaabbbbb 1 aaa|bbb  # YESS!!!
aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....

A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough bs, and somehow it counted wrong! We'll examine why this happened in the next step.

Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.


Step 4½: Understanding what went wrong

The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough b's. Let's closely examine what happens at every iteration of our pattern with aaaaabbb as input.

 a a a a a b b b
↑
# Initial state: Group 1 is "uninitialized".
           _
 a a a a a b b b
  ↑
  # 1st iteration: Group 1 couldn't match \1 since it was "uninitialized",
  #                  so it matched and captured just b
           ___
 a a a a a b b b
    ↑
    # 2nd iteration: Group 1 matched \1b and captured bb
           _____
 a a a a a b b b
      ↑
      # 3rd iteration: Group 1 matched \1b and captured bbb
           _
 a a a a a b b b
        ↑
        # 4th iteration: Group 1 could still match \1, but not \1b,
        #  (!!!)           so it matched and captured just b
           ___
 a a a a a b b b
          ↑
          # 5th iteration: Group 1 matched \1b and captured bb
          #
          # No more a, + "loop" terminates

A-ha! On our 4th iteration, we could still match \1, but we couldn't match \1b! Since we allow the self-reference matching to be optional with \1?, the engine backtracks and took the "no thanks" option, which then allows us to match and capture just b!

Do note, however, that except on the very first iteration, you could always match just the self-reference \1. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we captured bbb last time, we're guaranteed that there will still be bbb, but there may or may not be bbbb this time).

Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.


Step 5: Self-possession to the rescue!

The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply ?, use ?+ instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).

In very informal terms, this is what ?+, ? and ?? says:

?+

  • (optional) "It doesn't have to be there,"
    • (possessive) "but if it is there, you must take it and not let go!"

?

  • (optional) "It doesn't have to be there,"
    • (greedy) "but if it is you can take it for now,"
      • (backtracking) "but you may be asked to let it go later!"

??

  • (optional) "It doesn't have to be there,"
    • (reluctant) "and even if it is you don't have to take it just yet,"
      • (backtracking) "but you may be asked to take it later!"

In our setup, \1 will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus, \1?+ would accomplish exactly what we want.

$r5 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

Now the output is (as seen on ideone.com):

aaa 0
aaab 1 a|b          # Yay! Fixed!
aaaxb 0
xaaab 0
b 0
abbb 1 a|b
aabb 1 aa|bb
aaabbbbb 1 aaa|bbb
aaaaabbb 1 aaa|bbb  # Hurrahh!!!

Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!

Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.


Step 6: Finishing touches

So what we have right now is a pattern that matches a repeatedly, and for every a that was matched, there is a corresponding b captured in group 1. The + terminates when there are no more a, or if the assertion failed because there isn't a corresponding b for an a.

To finish the job, we simply need to append to our pattern \1 $. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extra b's in the string; in other words, that in fact we have anbn.

Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb',
  '', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',
  str_repeat('a', 5000).str_repeat('b', 5000)
);

$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

It finds 4 matches: ab, aabb, aaabbb, and the a5000b5000. It takes only 0.06s to run on ideone.com.


Step 7: The Java test

So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.

public static void main(String[] args) {

        String aNbN = "(?x) (?:  a  (?= a* (\\1?+ b))  )+ \\1";
        String[] tests = {
                "",      // false
                "ab",    // true
                "abb",   // false
                "aab",   // false
                "aabb",  // true
                "abab",  // false
                "abc",   // false
                repeat('a', 5000) + repeat('b', 4999), // false
                repeat('a', 5000) + repeat('b', 5000), // true
                repeat('a', 5000) + repeat('b', 5001), // false
        };
        for (String test : tests) {
                System.out.printf("[%s]%n  %s%n%n", test, test.matches(aNbN));
        }

}

static String repeat(char ch, int n) {
        return new String(new char[n]).replace('\0', ch);
}

The pattern works as expected (as seen on ideone.com).


And now we come to the conclusion...

It needs to be said that the a* in the lookahead, and indeed the "main + loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).

It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match ^(a+)(b+)$, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.

In PHP, it may look something like this (as seen in ideone.com):

function is_anbn($s) {
   return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&
      (strlen($groups[1]) == strlen($groups[2]));
}

The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.

As mentioned at the top, while this article is necessarily tagged [regex] for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.


Bonus material! PCRE recursive pattern!

Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for preg_match (as seen on ideone.com):

$rRecursive = '/ ^ (a (?1)? b) $ /x';

Currently Java's regex does not support recursive pattern.


Even more bonus material! Matching anbncn !!

So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?

The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).

^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $

这篇关于我们怎样才能将^ n b ^ n与Java正则表达式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆