基于DFA的正则表达式匹配 - 如何获取所有匹配项? [英] DFA based regular expression matching - how to get all matches?

查看:331
本文介绍了基于DFA的正则表达式匹配 - 如何获取所有匹配项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个代表正则表达式的给定DFA。
我想将DFA与输入流进行匹配,并获得所有可能的匹配,而不仅仅是最短 - 最长的匹配。



例如:



regex:a * ba | baa



输入:aaaaabaaababbabbbaa



结果:


  1. aaaaaba

  2. aaba



  3. 解决方案

    假设



    根据您的问题和后续评论,非重叠,匹配子字符串,将不匹配的句子部分丢弃。你也似乎想要最佳的运行时性能。此外,我假设您已经有一个现有的算法来将正则表达式转换为DFA表单。我进一步假设你这样做通常的方法,首先构建一个NFA,并将其通过子集构造转换为DFA,因为我不知道有任何其他方式来完成这个。



    在你追逐阴影之前,确保你尝试为工作应用正确的工具。对正则表达式的任何讨论几乎总是被这样的事实所困扰,即,人们使用正则表达式的事情比他们真正最优的事情多。如果你想获得正则表达式的好处,请确保你使用正则表达式,而不是更宽泛的。如果你想做的事情不能被编码成正则表达式本身,那么你不能从正则表达式算法的优势中受益(完全)



    明显的例子是没有智能量将允许FSM或任何算法来预测未来。例如,当匹配字符串 aaa 时,使用(a * b)|(a) 。其中省略号是尚未扫描的表达式部分,因为用户尚未输入它们,不能给您每个可能的右子组。



    有关正则表达式实现的更详细的讨论,特别是Thompson NFA的请查看此链接,其中介绍了一些简单的C实现以及一些巧妙的优化。



    正则语言的限制



    正则表达式算法的O(n)和Space(O(1))保证是一个相当狭窄的声明。具体来说,常规语言是可以在常量空间中识别的所有语言的集合。这种区别很重要。对接受或拒绝句子更复杂的算法的任何种类的增强可能在比常规语言更大的语言集合上操作。除此之外,如果你可以显示一些增强需要大于常量的空间来实现,那么你也是在性能保证之外。话虽这么说,如果我们非常小心地将我们的算法保持在这些狭窄的约束内,我们仍然可以做很多。



    显然,消除了我们可能想用递归回溯做的任何事情。堆栈没有常量空间。甚至维护指针到句子中也是可以的,因为我们不知道句子有多长。一个足够长的句子会溢出任何整数指针。我们不能为自动机创建新的状态,因为我们要解决这个问题。在将识别器暴露给任何输入之前,所有可能的状态(和几个不可能的状态)必须是可预测的,并且该数量必须由一些常数限定,该常数对于我们想要匹配的特定语言可能不同,但是没有其他变量。 / p>

    这仍然允许一些空间添加额外的行为。获得更多里程的通常方式是为处理中的某些事件发生的地方添加一些额外的注释,例如当子表达式开始或停止匹配时。因为我们只允许有恒定的空间处理,这限制了我们可以处理的子表达式匹配的数量。这通常意味着该子表达式的最新实例。这就是为什么当你要求(a |)* 匹配的子组时,你总是得到一个空字符串,因为 a 的后面隐含着无数的空字符串。



    另一个常见的增强功能是在状态之间做一些聪明的事情。例如,在perl regex中, \b 匹配空字符串,但前提是前一个字符是字符,下一个字符不匹配,反之亦然。许多简单的断言适合这一点,包括公共线锚操作符 ^ $ 。 lookahead和lookbehind断言也是可能的,但更困难。



    当讨论各种常规语言识别器之间的差异时,值得澄清,如果我们谈论匹配识别或搜索识别,前者是只有整个句子是在语言中的接受,并且后者接受句子中的任何子串是否在语言中。这些是等同的,如果某些表达式 E 被搜索方法接受,则(E)在匹配方法中被接受。



    这很重要,因为我们可能要清楚表示 a * b | a 是否接受 aa 。在搜索方法中,它。任一个令牌将匹配分离的右侧。它不匹配,虽然,因为你永远不可能通过逐步通过表达式和生成令牌从转换,至少在一个单一的过程中得到那个句子。为此,我只会谈论匹配语义。显然,如果你想要搜索语义,你可以使用。* '

    修改表达式

    注意:通过表达式 E |。* 而不管 E 的子语言,因为它匹配所有可能的句子。这对于正则表达式识别器来说是一个真正的挑战,因为它们实际上只适合于识别语言或者确认句子不是以相同的语言,而不是做任何更具体的工作。



    正式语言识别器的实现



    通常有三种方法来处理正则表达式。所有三个开始相同,通过将表达式转换为NFA。此过程为原始表达式中的每个生产规则生成一个或两个状态。规则是极其简单的。这里有一些粗糙的艺术:注意 a 是语言字母表中的任何单个字符, E1 E2 是任何正则表达式。 Epsilon(ε)是一个具有输入和输出的状态,但忽略字符流,并且不消耗任何输入。

      a :: => a  - > 

    E1 E2 :: => - E1 - > - E2 - >

    / ---->
    E1 * :: => --ε< -\
    \ /
    E1

    / -E1 - >
    E1 | E2 :: => ε
    \-E2 - >

    就是这样!常用的用法如E +,E ?,, abc]分别等于EE *,(E |),(a | b | c)。还要注意,我们为每个生产规则添加非常少量的新状态。实际上,每个规则都添加零或一个状态(在本演示中)。字符,量词和dysjunction都只添加一个状态,并且连接不添加任何。通过将片段的结束指针更新为其他状态或片段的开始指针来完成所有其他操作。



    ε跃迁状态很重要,因为它们不明确。遇到时,机器应该将状态更改为下一个状态还是另一个状态?它应该改变状态还是保持不变?这就是为什么这些自动机被称为非确定性的原因。解决方案是让自动机转换到正确状态,无论哪个允许它匹配最好的。因此,棘手的部分是弄清楚如何做。



    有两种方法。第一种方法是尝试每一个。按照第一个选择,如果这不工作,请尝试下一个。这是递归回溯,出现在几个(和值得注意的)实现中。对于精心设计的正则表达式,此实现几乎不需要额外的工作。如果表达式有点复杂,递归回溯非常非常糟糕,O(2 ^ n)。



    另一种方法是改为尝试两者选项。在每个ε跃迁处,将当前状态的添加到ε跃迁建议的两个状态。因为你使用一个集合,你可以有相同的状态出现不止一次,但你只需要跟踪一次,无论你处于那种状态或不。如果你得到的一点,没有一个特定的状态的选项跟随,只是忽略它,该路径不匹配。如果没有更多的状态,那么整个表达式不匹配。一旦任何状态达到最终状态,就完成了。



    刚才的解释,我们要做的工作量已经增加了一点。我们已经从跟踪单个状态到几个。在每次迭代时,我们可能必须更新m个状态指针的顺序,包括检查重复项。此外,我们需要的存储量已经增加了,因为现在它不再是一个指针指向一个可能的状态在NFA,而是一整套的。



    但是,这听起来不是那么糟糕。首先,状态数量由原始正则表达式中的生成数量限制。从现在开始,我们会将此值 与输入中的符号数区分开来, 。如果两个状态指针最终转换到相同的新状态,你可以丢弃其中的一个,因为不管发生什么,他们将遵循从那里开始的相同的路径。这意味着您需要的状态指针的数量受状态数量的限制,因此是 m



    在最坏的情况下,与回溯相比,这是一个更大的胜利。从输入消耗每个字符后,您将最多创建,重命名或销毁 m 个状态指针。没有办法来创建一个正则表达式,这将导致你执行更多的指令(​​乘一些常数因子取决于你的确切实现),或者将导致你在堆栈或堆分配更多的空间。



    这个NFA同时处于其某些子集的状态,可以被认为是一些其他状态机,其状态代表NFA的状态集它的模型可以是在该FSM的每个状态表示来自NFA的状态的功率集合的一个元素。这是用于匹配正则表达式的DFA实现。



    使用此替代表示有一个优点,而不是更新 m 状态指针,只需要更新一个。它也有一个缺点,因为它建模的 m 状态的权力,它实际上有多达2个 m 状态。这是一个上限,因为你不能模拟不能发生的状态,例如表达式 a | b 在读取第一个字符后有两个可能的状态,看到 a ,或看到 b 的那个。无论您提供什么输入,它都不能同时处于这两种状态,因此状态集不会显示在DFA中。事实上,因为你消除了epsilon转换的冗余,许多简单的DFA实际上比它们代表的NFA获得SMALLER,但是根本没有办法保证这一点。



    为了防止状态爆炸增长太大,在该算法的几个版本中使用的解决方案是仅生成实际需要的DFA状态,如果得到的状态太多,请丢弃最近未使用的DFA状态。您可以随时再次生成它们。



    从理论到实践



    正则表达式的许多实际用法包括跟踪输入的位置。这是技术上的欺骗,因为输入可以任意长。即使你使用了64位指针,输入可能是2 64 +1个符号长,你会失败。你的位置指针必须随着输入的长度增长,现在你的算法现在需要超过常量空间来执行。在实践中,这是不相关的,因为如果你的正则表达式最终通过这么多的输入工作,你可能不会注意到它会失败,因为你会在那之前终止它。



    当然,我们要做的不仅仅是接受或拒绝整个输入。最有用的变化是提取子匹配,以发现输入的哪个部分与原始表达式的某个部分匹配。实现这一点的简单方法是为表达式中的每个开口和闭合大括号添加一个ε跃迁。当FSM模拟器遇到这些状态之一时,它用关于在遇到特定转换时输入中的哪里的信息来注释状态指针。如果相同的指针第二次返回到该转换,则旧的注释被丢弃并且用新的输入位置的新的注释来替换。如果两个状态的指针与不同意的注释折叠到相同的状态,稍后输入位置的注释再次赢了。



    如果你坚持Thompson NFA或DFA实现,没有真正的贪婪或非贪婪匹配的概念。回溯算法需要给出一个提示,它是否应该尝试尽可能多地匹配,并递归地尝试更少,或尝试尽可能少,并递归地尝试更多,当它第一次尝试失败。 Thompson NFA方法同时尝试所有可能的量。另一方面,你可能仍然希望使用一些贪婪的/无情的暗示。此信息将用于确定是否应优选较新或较旧的子匹配注释,以便仅捕获输入的正确部分。



    另一种实用增强是断言,产生不消耗输入,但基于输入位置的某些方面匹配或拒绝。例如,在perl regex中, \b 表示输入必须在该位置包含字边界,使得刚匹配的符号必须是字字符,但是下一个字符不能为空,反之亦然。同样,我们通过向模拟器添加带特殊指令的epsilon转换来管理这一过程。如果断言通过,则状态指针继续,否则它被丢弃。



    前瞻和后瞻断言可以通过更多的工作来实现。典型的后瞻断言 r0 (?< = r1 r2 会转换为两个单独的表达式,。* r1 r0 code>ε r2 。这两个表达式都应用于输入。注意,我们向断言表达式添加了。* ,因为我们实际上并不关心它的开始位置。当模拟器遇到第二个生成的片段中的epsilon时,它会检查第一个片段的状态。如果该片段处于其中可以接受的状态,则断言通过状态指针流入 r2



    Lookahead也可以通过使用额外的正则表达式片段来运行,断言,但是有点更复杂,因为当我们到达输入中断言必须成功的点时,没有遇到对应的字符(在lookbehind的情况下,它们都被遇到)。相反,当模拟器到达断言时,它在断言子表达式的开始状态中启动指针,并在模拟的主要部分中注释状态指针,使得它知道它取决于子表达式指针。在每一步,模拟必须检查它依赖的状态指针是否仍然匹配。如果它没有找到一个,那么它会失败,无论它碰巧是什么。你不必保留任何更多的断言子表达式状态指针的副本,而不是为主要部分,如果断言中的两个状态指针放在相同的状态,那么每个它们依赖的状态指针将共享相同fate,并且可以重新注释以指向您保留的单个指针。



    在向epsilon转换添加特殊指令时,建议一个指令不是一个可怕的主意模拟器暂停一次,让用户看到发生了什么。每当模拟器遇到这样的转换时,它将在某种包装中包装其当前状态,该包可以返回到调用者,检查或改变,然后在停止的地方恢复。这可以用于交互地匹配输入,因此如果用户仅键入部分匹配,则模拟器可以请求更多输入,但是如果用户键入无效的东西,则模拟器是空的,并且可以向用户投诉。另一种可能性是每次匹配子表达式时产生,从而允许您查看输入中的每个子匹配。这不能用于排除一些子匹配。例如,如果您尝试为 aaa 匹配((a)* b),则可以看到三个子匹配即使整个表达式最终失败,因为没有b,并且没有对应的b的子匹配



    最后,可能有一种方法来修改这个工作具有反向引用。即使它是elegent,它肯定是低效的,具体来说,正则表达式加反向引用是在NP完成,所以我甚至不会试图想出一种方法这样做,因为我们只有兴趣(这里)(渐近)有效的可能性。


    I have a given DFA that represent a regular expression. I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.

    For example:

    regex: a*ba|baa

    input: aaaaabaaababbabbbaa

    result:

    1. aaaaaba
    2. aaba
    3. ba
    4. baa

    解决方案

    Assumptions

    Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.

    Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)

    An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.

    For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.

    Limitations of Regular Languages

    The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.

    Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.

    This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.

    The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.

    When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.

    This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s

    Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.

    Implementation of Regular Language Recognizers

    There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.

    a     ::=  > a ->
    
    E1 E2 ::=   >- E1 ->- E2 ->
    
                   /---->
    E1*   ::= > --ε <-\
                   \  /
                    E1
    
                 /-E1 ->
    E1|E2 ::= > ε
                 \-E2 ->
    

    And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.

    The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.

    There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).

    The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.

    Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.

    However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.

    This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.

    This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.

    Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.

    To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.

    From Theory to Practice

    Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.

    Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.

    If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.

    Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.

    Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.

    Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.

    While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's

    Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.

    这篇关于基于DFA的正则表达式匹配 - 如何获取所有匹配项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆