使用正则表达式分割不同长度的字符串 [英] Split string of varying length using regex

查看:354
本文介绍了使用正则表达式分割不同长度的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道使用正则表达式是否可行。我只是想问一下有人知道答案。

I don't know if this is possible using regex. I'm just asking in case someone knows the answer.

我有一个 string =hellohowareyou ?? 。我需要像这样拆分

I have a string ="hellohowareyou??". I need to split it like this

[h,el,loh,owar,eyou?,?]

完成拆分,使第一个字符串的长度为1,第二个长度为2,依此类推。最后一个字符串将包含剩余的字符。我可以使用这样的函数在没有正则表达式的情况下轻松完成。

The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.

public ArrayList<String> splitString(String s)
    {
        int cnt=0,i;
        ArrayList<String> sList=new ArrayList<String>();
        for(i=0;i+cnt<s.length();i=i+cnt)
        {
         cnt++;
         sList.add(s.substring(i,i+cnt));    
        }
        sList.add(s.substring(i,s.length()));
        return sList;
    }

我只是好奇是否可以使用正则表达式完成此事。

I was just curious whether such a thing can be done using regex.

推荐答案

解决方案



以下代码段生成完成工作的模式( 看到它在ideone.com上运行):

// splits at indices that are triangular numbers
class TriangularSplitter {

  // asserts that the prefix of the string matches pattern
  static String assertPrefix(String pattern) {
    return "(?<=(?=^pattern).*)".replace("pattern", pattern);
  }
  // asserts that the entirety of the string matches pattern
  static String assertEntirety(String pattern) {
    return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
  }
  // repeats an assertion as many times as there are dots behind current position
  static String forEachDotBehind(String assertion) {
    return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
  }

  public static void main(String[] args) {
    final String TRIANGULAR_SPLITTER =
      "(?x) (?<=^.) | measure (?=(.*)) check"
        .replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
        .replace("notGyet", assertPrefix("(?! \\1 \\G)"))
        .replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
        .replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
        .replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
        ;
    String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    System.out.println(
        java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
    );
    // [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
  }
}

请注意,此解决方案使用的是我的正则表达式文章系列中已涵盖的技术。这里唯一的新功能是 \ G 和转发参考。

Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.

这是所用基本正则表达式结构的简要说明:

This is a brief description of the basic regex constructs used:


  • (?x )是嵌入式标志 修饰符 启用自由间距模式,其中忽略未转义的空格(和可用于评论。)

  • ^ $ 是开头和行尾 的。 \ G 结束上一场比赛 主播。

  • | 表示 交替 (即或)。

  • 作为重复说明符表示可选(即零或一)。作为例如重复量词。 。*?它表示 * (即零次或多次)重复 不情愿 /非贪婪。

  • (...)用于 分组 的。 (?: ...)是一个非捕获组。捕获组保存匹配的字符串;除其他外,它允许匹配后退/前进/嵌套引用(例如 \1 )。

  • (?= ...)是一个积极的 lookahead 的;它看起来有权断言给定模式的匹配。(?< = ...)是一个积极的 lookbehind ;它向左看。

  • (?!...)否定前瞻;它看起来有权断言不是模式的匹配。

  • (?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
  • ^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
  • | denotes alternation (i.e. "or").
  • ? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
  • (…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
  • (?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
  • (?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.

  • <$ c中的文章$ c> [嵌套参考] 系列:

    • Articles in the [nested-reference] series:
      • How does this regex find triangular numbers?
      • How can we match a^n b^n with Java regex?
      • How does this Java regex detect palindromes?

      模式匹配零宽度断言。一个相当复杂的算法用于断言当前位置是三角数。有两种主要选择:

      The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:


      • (?< = ^。),即我们可以向后看并看到字符串的开头一个点


        • 这与索引1相匹配,是其余过程的关键起点

        • (?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
          • This matches at index 1, and is a crucial starting point to the rest of the process

          因此,第一种选择是琐碎的基本情况,第二种选择设置如何在此之后进行所有后续匹配。 Java没有自定义命名的组,但这里是3个捕获组的语义:

          Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:


          • \\ \\ 1 捕获字符串before \ G

          • \ 2 捕获一些字符串后 \ G

          • 如果长度 \1 是例如 1 + 2 + 3 + ... + k ,那么 \2 的长度需要 k


            • 因此 \ 2. 的长度为 k + 1 ,应该是拆分的下一部分

            • \1 captures the string "before" \G
            • \2 captures some string "after" \G
            • If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
              • Hence \2 . has length k+1 and should be the next part in our split!

              • 因此,当我们可以 assertEntirety \1 \ G \2上。 \ 3 ,我们匹配并设置新的 \ G

              • Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G

              您可以使用数学归纳来严格证明此算法的正确性。

              You can use mathematical induction to rigorously prove the correctness of this algorithm.

              帮助说明这是如何工作的,让我们通过一个例子。我们将 abcdefghijklm 作为输入,并说我们已经部分拆分 [a,bc,def]

              To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].

                        \G     we now need to match here!
                         ↓       ↓
              a b c d e f g h i j k l m n
              \____1____/ \_2_/ . \__3__/   <--- \1 G \2 . \3
                L=1+2+3    L=3           
              

              请记住 \ G 标记最后一场比赛的结束,它出现在三角数字索引处。如果 \ G 发生在 1 + 2 + 3 + ... + k ,那么下一场比赛需要 k + \ G 之后的1 位置是三角数字索引。

              Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.

              因此在我们的示例中,鉴于 \ G 是我们刚从 def 拆分的地方,我们测得 k = 3 ,下一场比赛将按预期分拆 ghij

              Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.

              \ n \2 按照上面的规范构建,我们基本上做循环:只要它是 notGyet ,我们按如下方式计算 k

              To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:


              • + NBefore ,即我们通过以下方式扩展 \1 一个 forEachDotBehind

              • + 1A ,即我们延长 \2 只有一个

              • +NBefore, i.e. we extend \1 by one forEachDotBehind
              • +1After, i.e. we extend \2 by just one

              请注意 notGyet 包含对组1的前向引用,后者在模式中定义。基本上我们做循环直到 \1 点击 \ G

              Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.

              毋庸置疑,此特定解决方案的性能非常糟糕。正则表达式引擎只会记住 WHERE 最后一次匹配(使用 \ G ),并忘记 HOW (即当下一次匹配尝试时,将重置所有捕获组)。然后我们的模式必须重建 HOW (传统解决方案中不必要的步骤,其中变量不那么健忘),通过一次附加一个字符(<$ c)精心构建字符串$ C> O(N ^ 2))。每个简单的测量都是线性的而不是恒定的时间(因为它是作为字符串匹配完成的,其中长度是一个因子),并且最重要的是我们做了许多冗余的测量(即延长一个,我们需要首先重新匹配我们已经拥有的东西)。

              Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).

              可能有很多更好的正则表达式解决方案。尽管如此,这个特定解决方案的复杂性和低效率应该正确地表明正则表达式不是为这种模式匹配而设计的。

              There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.

              这就是说,出于学习目的,这是这是一个绝对精彩的问题,因为在研究和制定解决方案方面有丰富的知识。希望这个特殊的解决方案及其解释具有指导意义。

              That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.

              这篇关于使用正则表达式分割不同长度的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆