环视是否会影响正则表达式可以匹配哪些语言? [英] Does lookaround affect which languages can be matched by regular expressions?

查看:86
本文介绍了环视是否会影响正则表达式可以匹配哪些语言?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现代正则表达式引擎中有一些功能,这些功能使您可以匹配没有该功能就无法匹配的语言.例如,以下使用反向引用的正则表达式与包含重复一个单词的所有字符串的语言相匹配:(.+)\1.该语言不是常规语言,不能与不使用反向引用的正则表达式匹配.

环顾四周还会影响正则表达式可以匹配哪些语言吗? IE.是否有可以使用环视匹配的语言,否则无法匹配?如果是这样,这是对所有环顾方式(否定的正向查找或正向的查找或向后查找)还是仅某些其中的一种是正确的?

解决方案

正如其他答案所述,环顾四周不会为正则表达式增加任何额外的功能.

我认为我们可以使用以下内容来显示这一点:

一个Pebble 2-NFA (请参见引用它的文件).

1-pebble 2NFA不会处理嵌套的超前行为,但是,我们可以使用多卵石2NFA的变体(请参见下文).

简介

2-NFA是一种不确定的有限自动机,可以在其输入上向左或向右移动.

一台卵石机可以将卵石放在输入磁带上(即用卵石标记特定的输入符号),并根据当前输入位置是否有卵石进行不同的过渡.

众所周知,One Pebble 2-NFA具有与常规DFA相同的功能.

非嵌套先行

基本思想如下:

2NFA允许我们通过在输入磁带中向前或向后移动来回溯(或前磁道").因此,对于前瞻,我们可以对前瞻正则表达式进行匹配,然后回溯我们在匹配前瞻表达式时所消耗的内容.为了确切地知道何时停止回溯,我们使用了卵石!在输入dfa进行前瞻之前,我们先放下卵石,以标记需要停止回溯的位置.

因此,在通过pebble 2NFA运行字符串的末尾,我们知道我们是否匹配了lookahead表达式,并且剩下的输入(即剩下的要消耗的)正好是匹配其余表达式的要求./p>

所以先看一下形式为u(?= v)w

我们有用于u,v和w的DFA.

从u的DFA的接受状态(是的,我们可以假设只有一个)开始,将e转换为v的起始状态,并用小卵石标记输入.

从v的接受状态开始,我们进行电子转换,以保持输入向左移动,直到找到卵石为止,然后转换为w的起始状态.

从v的拒绝状态开始,我们进行电子跃迁,直到保持找到卵石的状态一直向左移动,然后转换为u的接受状态(即我们离开的位置).

常规NFA用于显示r1的证明| r2或r *等将这些小卵石2nfas保留下来.参见 http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node41.html#regularlanguages.sec.regexptofsa 有关如何将组件机器组合在一起以为更大的机器提供更多信息的信息r *表达式等.

以上证明r *等有效的原因是,当我们输入要重复的分量nfas时,回溯可确保输入指针始终位于正确的位置.同样,如果正在使用卵石,那么正由超前组件机器之一对其进行处理.由于没有先行机器到先行机器的转换而没有完全回溯并返回卵石,因此只需要一台卵石机器.

例如,考虑([^ a] | a(?= ... b))*

和字符串abbb.

我们有一个abbb,它通过peb2nfa求a(?= ... b),在它的结尾处我们处于状态:(bbb,matched)(即,输入bbb仍然存在,并且已经匹配"a",然后是"..b").现在由于*,我们回到开头(请参阅上面链接中的构造),并为[^ a]输入dfa.匹配b,回到开头,再次输入[^ a]两次,然后接受.

处理嵌套先行

要处理嵌套先行,我们可以使用此处定义的k-pebble 2NFA受限版本: 解决方案

As the other answers claim, lookarounds don't add any extra power to regular expressions.

I think we can show this using the following:

One Pebble 2-NFA (see the Introduction section of the paper which refers to it).

The 1-pebble 2NFA does not deal with nested lookaheads, but, we can use a variant of multi-pebble 2NFAs (see section below).

Introduction

A 2-NFA is a non deterministic finite automaton which has the ability to move either left or right on it's input.

A one pebble machine is where the machine can place a pebble on the input tape (i.e. mark a specific input symbol with a pebble) and do possibly different transitions based on whether there is a pebble at the current input position or not.

It is known the One Pebble 2-NFA has the same power as a regular DFA.

Non-nested Lookaheads

The basic idea is as follows:

The 2NFA allows us to backtrack (or 'front track') by moving forward or backward in the input tape. So for a lookahead we can do the match for the lookahead regular expression and then backtrack what we have consumed, in matching the lookahead expression. In order to know exactly when to stop backtracking, we use the pebble! We drop the pebble before we enter the dfa for the lookahead to mark the spot where the backtracking needs to stop.

Thus at the end of running our string through the pebble 2NFA, we know whether we matched the lookahead expression or not and the input left (i.e. what is left to be consumed) is exactly what is required to match the remaining.

So for a lookahead of the form u(?=v)w

We have the DFAs for u, v and w.

From the accepting state (yes, we can assume there is only one) of DFA for u, we make an e-transition to the start state of v, marking the input with a pebble.

From an accepting state for v, we e-transtion to a state which keeps moving the input left, till it finds a pebble, and then transitions to start state of w.

From a rejecting state of v, we e-transition to a state which keeps moving left until it finds the pebble, and transtions to the accepting state of u (i.e where we left off).

The proof used for regular NFAs to show r1 | r2, or r* etc, carry over for these one pebble 2nfas. See http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node41.html#regularlanguages.sec.regexptofsa for more info on how the component machines are put together to give the bigger machine for the r* expression etc.

The reason why the above proofs for r* etc work is that the backtracking ensures that the input pointer is always at the right spot, when we enter the component nfas for repetition. Also, if a pebble is in use, then it is being processed by one of the lookahead component machines. Since there are no transitions from lookahead machine to lookahead machine without completely backtracking and getting back the pebble, a one pebble machine is all that is needed.

For eg consider ([^a] | a(?=...b))*

and the string abbb.

We have abbb which goes through the peb2nfa for a(?=...b), at the end of which we are at the state: (bbb, matched) (i.e in input bbb is remaining, and it has matched 'a' followed by '..b'). Now because of the *, we go back to the beginning (see the construction in the link above), and enter the dfa for [^a]. Match b, go back to beginning, enter [^a] again two times, and then accept.

Dealing with Nested Lookaheads

To handle nested lookaheads we can use a restricted version of k-pebble 2NFA as defined here: Complexity Results for Two-Way and Multi-Pebble Automata and their Logics (see Definition 4.1 and Theorem 4.2).

In general, 2 pebble automata can accept non-regular sets, but with the following restrictions, k-pebble automata can be shown to be regular (Theorem 4.2 in above paper).

If the pebbles are P_1, P_2, ..., P_K

  • P_{i+1} may not be placed unless P_i is already on the tape and P_{i} may not be picked up unless P_{i+1} is not on the tape. Basically the pebbles need to be used in a LIFO fashion.

  • Between the time P_{i+1} is placed and the time that either P_{i} is picked up or P_{i+2} is placed, the automaton can traverse only the subword located between the current location of P_{i} and the end of the input word that lies in the direction of P_{i+1}. Moreover, in this sub-word, the automaton can act only as a 1-pebble automaton with Pebble P_{i+1}. In particular it is not allowed to lift up, place or even sense the presence of another pebble.

So if v is a nested lookahead expression of depth k, then (?=v) is a nested lookahead expression of depth k+1. When we enter a lookahead machine within, we know exactly how many pebbles have to have been placed so far and so can exactly determine which pebble to place and when we exit that machine, we know which pebble to lift. All machines at depth t are entered by placing pebble t and exited (i.e. we return to processing of a depth t-1 machine) by removing pebble t. Any run of the complete machine looks like a recursive dfs call of a tree and the above two restrictions of the multi-pebble machine can be catered to.

Now when you combine expressions, for rr1, since you concat, the pebble numbers of r1 must be incremented by the depth of r. For r* and r|r1 the pebble numbering remains the same.

Thus any expression with lookaheads can be converted to an equivalent multi-pebble machine with the above restrictions in pebble placement and so is regular.

Conclusion

This basically addresses the drawback in Francis's original proof: being able to prevent the lookahead expressions from consuming anything which are required for future matches.

Since Lookbehinds are just finite string (not really regexs) we can deal with them first, and then deal with the lookaheads.

Sorry for the incomplete writeup, but a complete proof would involve drawing a lot of figures.

It looks right to me, but I will be glad to know of any mistakes (which I seem to be fond of :-)).

这篇关于环视是否会影响正则表达式可以匹配哪些语言?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆