这个Java正则表达式如何检测回文? [英] How does this Java regex detect palindromes?

查看:126
本文介绍了这个Java正则表达式如何检测回文?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


这是一系列教育正则表达式文章的第三部分。它遵循这个正则表达式如何找到三角形数字?(嵌套的地方)引用首先介绍)和我们如何匹配^ nb ^ n使用Java正则表达式?
(前瞻性的计数机制进一步详述)。这部分介绍了一种特定形式的嵌套断言,当与嵌套引用结合使用时,Java正则表达式可以匹配大多数人认为不可能的东西:回文!!

This is the third part in a series of educational regex articles. It follows How does this regex find triangular numbers? (where nested references is first introduced) and How can we match a^n b^n with Java regex? (where the lookahead "counting" mechanism is further elaborated upon). This part introduces a specific form of nested assertion, which when combined with nested references allows Java regex to match what most people believe is "impossible": palindromes!!

回文的语言是非定期;它实际上是无上下文(对于给定的字母表)。也就是说,现代正则表达式实现不仅仅识别常规语言,Perl / PCRE的递归模式和.NET的平衡组可以轻松识别回文(参见:相关问题)。

The language of palindromes is non-regular; it's actually context-free (for a given alphabet). That said, modern regex implementation recognizes more than just regular languages, and Perl/PCRE's recursive patterns and .NET's balancing groups can readily recognize palindromes (see: Related Questions).

但是,Java的正则表达式引擎既不支持这些高级功能。然而某人 * wink * 设法编写了以下正则表达式,似乎做得很好(另见ideone.com ):

However, Java's regex engine supports neither of these "advanced" features. And yet "someone" (*wink*) managed to write the following regex which seems to do the job just fine (see also on ideone.com):

public class Palindrome {
    // asserts that the entirety of the string matches the given pattern
    static String assertEntirety(String pattern) {
        return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
    }

    public static void main(String[] args) {
        final String PALINDROME =
            "(?x) | (?:(.) add)+ chk"
                .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                .replace("chk", assertEntirety("\\2"));

        System.out.println(PALINDROME);
        // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)

        String[] tests = {
            "",     // true
            "x",    // true
            "xx",   // true
            "xy",   // false
            "xyx",  // true
            "xxx",  // true
            "xxyx", // false
            "racecar",                // true
            "step on no pets",        // true
            "aManaPlanaCanalPanaMa",  // true
            "this is impossible",     // FALSE!!!
        };
        for (String test : tests) {
            System.out.printf("[%s] %s%n", test, test.matches(PALINDROME));
        }
    }
}

所以这似乎有效,但是如何?

So this seems to work, but how?

  • java.util.regex.Pattern
  • regular-expressions.info/Freespacing (?x), Lookarounds (?=…)/(?<=…), etc.

COMMON SENSE ALERT !!!



这不是检测回文的最佳方法;它最多只有 O(N ^ 3)。使用更通用的编程语言执行此检测既更高效又更直接。

COMMON SENSE ALERT!!!

This is not the best way to detect palindromes; it's O(N^3)at best. Performing this detection in a more general purpose programming language is both more efficient and more straightforward.

您不希望使用正则表达式来检测回答与你不想使用正则表达式查找素数的原因相同的回文。也就是说,您将研究非递归非平衡组正则表达式如何检测回文,原因与您研究正则表达式如何用于素性测试的原因相同:这很有趣,很有挑战性,很有教育意义。

You wouldn't want to use regex to detect palindromes for the same reasons you wouldn't want to use regex to find prime numbers. That said, you would study how a non-recursive non-balancing group regex can detect palindromes for the same reasons you would study how a regex can be used for primality testing: it's fun, it's challenging, it's educational.



相关问题




  • 如何使用正则表达式检查字符串是否为回文? - 这是不可能的! (除非......)

  • 如何检查给定字符串是否为回文? - 多种语言的非正则表达式解决方案

  • 如何判断一个数字是否是正则表达式的素数?

  • Related questions

    • How to check that a string is a palindrome using regular expressions? - it's "impossible"! (unless...)
    • How to check if the given string is palindrome? - non-regex solutions in many languages
    • How to determine if a number is a prime with regex?
    • 推荐答案

      大图



      我们先来看看这个正则表达式从整体大图算法,然后仔细看看后面的具体实现细节。正则表达式几乎是以下Java代码的直接翻译:

      The Big Picture

      We will first look at this regex from the overall big picture algorithm, and then take a closer look at the specific implementation details later. The regex is an almost direct translation of the following Java code:

      static boolean isPalindrome(String s) {
         if (s.isEmpty()) {
            return true;
         }
         String g2 = null;
         for (char ch : s.toCharArray()) {
            String g1 = String.valueOf(ch);
            // "add"
            if (g2 != null && s.endsWith(g1 + g2)) {
               g2 = g1 + g2;
            } else if (s.endsWith(g1)) {
               g2 = g1;
            } else {
               break;
            }
         }
         return s.equals(g2); // "chk"
      }
      

      这显然不是最简单/最有效的Java代码检查回文,但它的工作原理,最令人着迷的是,几乎可以通过近似的一对一映射直接转换为正则表达式。这里再次使用正则表达式,为了方便起见,在这里重现,注释以突出显着的相似之处:

      This is obviously not the most straightforward/efficient Java code to check for palindromes, but it works, and most fascinatingly, it's almost directly translatable to regex with a near one-to-one mapping. Here's the regex again, reproduced here for convenience, annotated to highlight the striking resemblance:

      //  isEmpty  _for-loop_
      //       ↓  /          \
          "(?x) | (?:(.) add)+ chk"
      //             \_/  ↑
      //             g1   loop body                   ___g2___
      //                                             /        \
                 .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                 .replace("chk", assertEntirety("\\2"));
                                 // s.equals(g2)
      

      附件 ideone.com上源代码的注释和扩展版本

      (现在可以随意忽略 assertEntirety 的细节:只需将其视为可以作出断言的黑盒正则表达式机制on 整个字符串,无论我们目前在哪里。)

      (Feel free to ignore the details of assertEntirety for now: just think of it as a black box regex mechanism that can make an assertion on the entire string regardless of where we currently are.)

      所以基本的算法是我们尝试建立一个后缀,取决于回文约束,因为我们从左到右扫描字符串。然后我们检查我们是否能够以这种方式构建完整的字符串。如果可以的话,那么这个字符串就是一个回文。此外,作为特殊情况,空字符串通常是回文。

      So the basic algorithm is that we try to build a suffix, subject to a palindromic constraint, as we scan the string left to right. We then check if we're able to build the complete string in this manner. If we can, then the string is a palindrome. Also, as a special case, the empty string is trivially a palindrome.

      一旦理解了大图片算法,我们就可以检查正则表达式模式如何实现它。

      Once the big picture algorithm is understood, we can examine how the regex pattern implements it.

      Java中的正则表达式模式最终只是字符串,这意味着它们可以通过字符串操作以任何字符串的方式派生。是的,我们甚至可以使用正则表达式来生成正则表达式模式 - 如果你愿意,可以采用一种元重复方法。

      Regex patterns in Java are ultimately nothing but strings, meaning they can be derived through string manipulations the way any string can. Yes, we can even use regex to generate a regex pattern -- a sort of meta-regexing approach, if you will.

      考虑这个初始化<$ c的例子$ c> int 常量(最终只包含一个数字):

      Consider this example of initializing an int constant (which ultimately contains nothing but a number):

      final int X = 604800;
      final int Y = 60 * 60 * 24 * 7;
      // now X == Y
      

      分配给的数字X 是一个字面整数:我们可以清楚地看到这个数字是多少。这不是使用表达式的 Y 的情况,但是这个公式似乎传达了这个数字所代表的想法。即使没有正确命名这些常量,我们仍然认为 Y 可能代表一周内的秒数,即使我们可能不会立即知道数值是多少。另一方面,使用 X 我们确切地知道这个数字,但我们对它所代表的内容了解不多。

      The number assigned to X is a literal integer: we can clearly see what the number is. This is not the case with Y which uses an expression instead, and yet this formula seems to convey an idea of what this number represents. Even without proper naming of these constants, we nonetheless get the idea that Y probably represents the number of seconds in a week, even if we may not immediately know what the numeric value is. On the other hand, with X we know precisely that number, but we get less of an idea of what it represents.

      在代码段中使用字符串替换是类似的情况,但对于字符串正则表达式模式。而不是明确地将模式写为一个文字字符串,有时从更简单的部分中获得该值的系统和逻辑推导(公式)可能更有意义。对于正则表达式来说尤其如此,我们理解模式的作用往往比能够看到它看起来像字符串文字的东西更重要(无论如何都不是很好看,所有额外的反斜杠都是什么) 。

      The use of string replacements in the snippet is an analogous situation but for strings regex patterns. Instead of explicitly writing the pattern as one literal string, sometimes systematic and logical derivation (the "formula") of that value from simpler parts can be much more meaningful. This is especially true with regex, where it often matters more that we understand what the pattern does than being able to see what it looks like as a string literal (which isn't much of a looker anyway, what with all the extra backslashes).

      为方便起见,此处再次转录部分片段:

      A portion of the snippet is reproduced here again for convenience:

      // the "formula"
           final String PALINDROME =
              "(?x) | (?:(.) add)+ chk"
                 .replace("add", assertEntirety(".*? (\\1 \\2?)"))
                 .replace("chk", assertEntirety("\\2"));
      
      // the "value"
           System.out.println(PALINDROME);
           //                       ____add_____             chk_
           //               _______/            \____   _______/ \_____
           // (?x) | (?:(.) (?<=(?=^.*? (\1 \2?)$).*))+ (?<=(?=^\2$).*)
           //        |  \_/             \______/     |
           //        |   1                 2         |
           //        |_______________________________|
      

      毫无疑问,公式比最终的字符串值更具可读性案例。

      Without a doubt the "formula" is a lot more readable than the eventual string "value" in this case.

      当然有更复杂的方法可以以编程方式生成正则表达式模式,并且当然可以用这样的方式编写,即混淆而不是强调其含义,但是谨慎使用即使是简单的字符串替换仍然可以令人惊讶(正如希望在此示例中所示)。

      There are certainly much more sophisticated ways to programmatically generate a regex pattern, and it's certainly possible to write in such a way that obfuscates instead of accentuates its meaning, but mindful usage of even simple string replacements can still do wonder (as hopefully shown in this example).

      课程 :考虑以编程方式生成正则表达式模式。

      Lesson: Consider programmatic generation of regex patterns.

      (?:(。)add)+ 构造,其中添加是一个做某种计数的断言,已经在前两部分中进行了彻底的讨论。但值得注意的是两个功能:

      The (?:(.) add)+ construct, where add is an assertion that does some sort of "counting", has already been thoroughly discussed in the previous two parts. Two features are worth noting, though:


      • (。)捕获到组中1,允许后面的反向引用

      • 断言是 assertEntirety 而不只是向前看我们当前的位置


        • 我们稍后会详细讨论这个问题;只是把它想象成一种断言整个字符串的方法

        • The (.) captures into group 1, allowing backreference later on
        • The assertion is assertEntirety instead of just looking ahead from our current position
          • We'll discuss this in more detail later; just think of it as a way to assert on the entire string

          模式应用于 assertEntirety 添加中如下:

          # prefix   _suffix_
          #    ↓    /        \
              .*?   ( \1 \2? )
          #         \________/   i.e. a reluctant "whatever" prefix (as short as possible)
          #          group 2          followed by a suffix captured into group 2
          

          请注意,第2组是自引用的,带有可选的说明符,这是本系列第2部分中已讨论过的技术。毋庸置疑,第2组是我们在这种模式中的反击:它是一个后缀,我们将尝试在循环的每次迭代中向左增长。当我们从左到右迭代每个(。)时,我们尝试在前面添加相同的字符(使用反向引用到 \1 )到我们的后缀。

          Note that group 2 is self-referencing with an optional specifier, a technique already discussed in part 2 of the series. Needless to say group 2 is our "counter" in this pattern: it's a suffix that we will try to grow leftward on every iteration of the "loop". As we iterate on each (.) left to right, we try to prepend that same character (using backreference to \1) to our suffix.

          再次回忆上述模式的Java代码翻译,为方便起见,这里转载:

          Recall again the Java code translation of the above pattern, reproduced here for convenience:

          if (g2 != null && s.endsWith(g1 + g2)) {   // \2? is greedy, we try this first
             g2 = g1 + g2;
          } else if (s.endsWith(g1)) {    // since \2? is optional, we may also try this
             g2 = g1;
          } else {        // if there's no matching suffix, we "break" out of the "loop"
             break;
          }
          

          \2?是可选的意味着一些事情:

          The fact that \2? is optional means a few things:


          • 它为自我引用提供了基本案例(我们的主要原因)这样做!)

          • 由于 \2?是后缀模式的一部分(因此在整个模式的后面出现),前缀部分必须是不情愿的,因此。*?而不是。* 。这允许 \2?行使其贪婪。

          • 计数器也可能重置并给出错误结果


            • 在第2部分中,我们展示了如何回溯可能导致同样的问题重置


              • 我们使用所有格量词?+ 解决了这个问题,但这不适用于此

              • It provides a "base case" for the self-reference (the main reason we do this!)
              • Since \2? is part of the suffix pattern (and thus appears later in the overall pattern), the prefix part must be reluctant, hence .*? instead of .*. This allows \2? to exercise its greediness.
              • The "counter" may also "reset" and give the "wrong" result
                • In part 2 we showed how backtracking ? may result in the same kind of problematic resetting
                  • We solved the problem by using possessive quantifier ?+, but this is not applicable here

                  第三点进一步阐述在下一部分中。

                  The third point is elaborated further in the next section.

                  课程 :仔细分析模式部分中贪婪/不情愿重复之间的相互作用。

                  Lesson: Carefully analyze the interactions between greedy/reluctant repetitions in parts of a pattern.

                  • Difference between .*? and .* for regex
                  • Regular expression: who's greedier?

                  如上一节所述,可选和可追溯的 \ 2?表示在某些情况下我们的后缀会缩小。我们将逐步检查此输入的这种情况:

                  As alluded in the previous section, the optional and backtrackable \2? means that our suffix can shrink under some circumstances. We will examine such a scenario step by step for this input:

                   x y x y z y x
                  ↑
                  # Initial state, \2 is "uninitialized"
                               _
                  (x)y x y z y x
                    ↑
                    # \1 captured x, \2 couldn't match \1\2 (since \2 is "uninitialized")
                    #                but it could match \1 so it captured x
                             ___
                   x(y)x y z y x
                      ↑
                      # \1 captured y, \2 matched \1\2 and grew to capture yx
                               _
                   x y(x)y z y x
                        ↑
                        # \1 captured x, \2 couldn't match \1\2,
                        #                but it could match \1 so it shrunk to capture x (!!!)
                             ___
                   x y x(y)z y x
                          ↑
                          # \1 captured y, \2 matched \1\2 and grew to capture yx
                           _____
                   x y x y(z)y x
                            ↑
                            # \1 captured z, \2 matched \1\2 and grew to capture zyx
                         _______
                   x y x y z(y)x
                              ↑
                              # \1 captured y, \2 matched \1\2 and grew to capture yzyx
                       _________
                   x y x y z y(x)
                                ↑
                                # \1 captured x, \2 matched \1\2 and grew to capture xyzyx
                  

                  我们可以修改我们的模式(和相应的Java代码)以省略 chk 阶段,看看确实发生了这种情况:

                  We can modify our pattern (and the corresponding Java code) to omit the chk phase, and see that indeed this is what happens:

                      // modified pattern without a chk phase; yields false positives!
                      final String PALINDROME_BROKEN =
                          "(?x) | (?:(.) add)+"
                              .replace("add", assertEntirety(".*? (\\1 \\2?)"));
                  
                      String s = "xyxyzyx"; // NOT a palindrome!!!
                  
                      Matcher m = Pattern.compile(PALINDROME_BROKEN).matcher(s);
                      if (m.matches()) {
                          System.out.println(m.group(2)); // prints "xyzyx"
                      }
                  

                  如上所述, xyxyzyx NOT 回文,被错误地报告为一个,因为我们没有检查增长的后缀是否最终成为完整的字符串(它显然没有在这种情况下)。 chk 阶段( assertEntirety 模式 \ 2 )在我们的设置中是绝对必要的。我们需要确认我们确实设法一直增加后缀。如果是这种情况,那么我们自己就是一个回文。

                  As explained, "xyxyzyx", which is NOT a palindrome, is falsely reported as one, because we didn't check if the growing suffix eventually became the complete string (which it clearly did not in this case). The chk phase (which is an assertEntirety of the pattern \2) is therefore an absolute necessity in our setup. We need to confirm that indeed we managed to grow our suffix all the way. If this is the case, then we have ourselves a palindrome.

                  课程 :仔细分析任何可能无意识的一面 - 可选自引用匹配的影响。

                  Lesson: Carefully analyze any possibly unintended side-effects of optional self-reference matching.

                  虽然我们可以写一个Java正则表达式来检测回文,但是除了 assertEntirety 之外的一切都很好已经在本系列的前几部分中介绍过。这里唯一新的东西是这个神秘的黑盒子,这个强大的机制让我们神奇地允许我们做不可能的事情。

                  While it's neat that we can write a Java regex pattern to detect palindromes, everything here except assertEntirety has already been covered in the previous parts of the series. The only new thing here is this mysterious black box, this powerful mechanism that magically allowed us to do what is otherwise "impossible".

                  assertEntirety construct基于以下嵌套外观的元模式:

                  The assertEntirety construct is based on the following meta-pattern of nested lookarounds:


                  (? < =(?= ^ pattern $)。*)


                  我能看到一个放在我身后的某个地方我可以向前看,看 ^ pattern $


                  名称lookaround意味着与我们当前位置的相对性:我们正在寻找围绕我们,可能在我们之前或之后站着。通过以这种方式在后视镜中嵌套前瞻,我们能够隐喻地飞向天空并查看整个画面。

                  The name "lookaround" implies the relativity to our current position: we're looking around us, perhaps in front of or behind, from where we're standing. By nesting a lookahead in a lookbehind in this manner, we're able to metaphorically "fly into the skies" and look at the entire picture.

                  抽象这个元模式进入 assertEntirety 有点像编写预处理替换宏。在任何地方嵌套的外观都可能会损害可读性和可维护性,因此我们将其封装到 assertEntirety 中,这不仅隐藏了其内部工作的复杂性,而且还通过赋予它来进一步强调其语义。一个合适的名称。

                  Abstracting this meta-pattern into assertEntirety is a bit like writing preprocessing substitution macros. Having nested lookarounds everywhere probably hurts readability and maintainability, so we encapsulate it into assertEntirety, which not only hides the complexity of its inner workings, but also further emphasizes its semantics by giving it an appropriate name.

                  课程 :考虑抽象元模式以隐藏复杂性并传达语义。

                  Lesson: Consider abstracting meta-patterns to hide complexity and convey semantics.

                  敏锐的读者会请注意 assertEntirety 在lookbehind中包含。* ,这使得它的理论最大长度无限。不,Java并没有正式支持无限长的外观。是的,因为它已在这里充分展示,无论如何它仍然有效。正式它被归类为虫子;但是某人(* wink *)也可以认为它是一个隐藏的功能。

                  Observant readers would notice that assertEntirety contains a .* in a lookbehind, which makes its theoretical maximum length infinite. No, Java does not officially support infinite-length lookbehind. Yes, as it has been adequatedly demonstrated here, it works anyway. Officially it's categorized as a "bug"; but "someone"(*wink*) could also consider it a "hidden feature".

                  这个bug肯定有可能将来固定。删除这个隐藏的功能将打破Java regex回文问题的特定解决方案。

                  It's certainly possible that this "bug" will be "fixed" in the future. Removal of this hidden feature will break this particular solution to the Java regex palindrome problem.

                  • Regex look-behind without obvious maximum length in Java

                  这篇关于这个Java正则表达式如何检测回文?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆