找出正则表达式失败的位置 [英] Find out the position where a regular expression failed

查看:43
本文介绍了找出正则表达式失败的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用JavaScript编写词法分析器,以查找一种简单的特定于域的语言的标记.我从一个简单的实现开始,该实现只是尝试从一行中的当前位置匹配后续的正则表达式,以找出它是否与某种令牌格式匹配并接受.

I'm trying to write a lexer in JavaScript for finding tokens of a simple domain-specific language. I started with a simple implementation which just tries to match subsequent regexps from the current position in a line to find out whether it matches some token format and accept it then.

问题是,当这样的正则表达式中某些内容不匹配时,整个正则表达式会失败,因此我不知道是哪个字符导致了它失败.

The problem is that when something doesn't match inside such regexp, the whole regexp fails, so I don't know which character exactly caused it to fail.

有什么方法可以找出导致正则表达式失败的字符串位置吗?

Is there any way to find out the position in the string which caused the regular expression to fail?

INB4:我不是在询问调试我的正则表达式并验证其正确性.已经正确,匹配正确的字符串并丢弃不正确的字符串.我只想以编程方式知道regexp到底在哪里停止匹配,找出在用户输入中不正确的字符的位置,以及有多少可以使用.

INB4: I'm not asking about debugging my regexp and verifying its correctness. It is correct already, matches correct strings and drops incorrect ones. I just want to know programmatically where exactly the regexp stopped matching, to find out the position of a character which was incorrect in the user input, and how much of them were OK.

是否有一些方法可以通过简单的正则表达式来实现,而不是继续实施成熟的有限状态自动机?

Is there some way to do it with just simple regexps instead of going on with implementing a full-blown finite state automaton?

推荐答案

简短答案

在字符串中没有位置"会导致正则表达式失败".

There is no such thing as a "position in the string that causes the regular expression to fail".

但是,我将向您展示一种回答相反问题的方法:

However, I will show you an approach to answer the reverse question:

引擎在哪个正则表达式中无法匹配字符串?

At which token in the regex did the engine become unable to match the string?

讨论

在我看来,导致正则表达式失败的字符串位置的问题是颠倒的.当引擎用左手将字符串向下移动,用右手将模式向下移动时,由于量化器和回溯,稍后可以一次匹配六个字符的正则表达式令牌可以减少到下一个匹配零个字符,或者扩展为匹配十.

In my view, the question of the position in the string which caused the regular expression to fail is upside-down. As the engine moves down the string with the left hand and the pattern with the right hand, a regex token that matches six characters one moment can later, because of quantifiers and backtracking, be reduced to matching zero characters the next—or expanded to match ten.

我认为,更恰当的问题是:

In my view, a more proper question would be:

引擎在哪个正则表达式中无法匹配字符串?

At which token in the regex did the engine become unable to match the string?

例如,考虑正则表达式 ^ \ w + \ d + $ 和字符串 abc132z .

For instance, consider the regex ^\w+\d+$ and the string abc132z.

\ w + 实际上可以匹配整个字符串.然而,整个正则表达式失败.说正则表达式在字符串末尾失败是否有意义?我不这么认为.考虑一下.

The \w+ can actually match the entire string. Yet, the entire regex fails. Does it make sense to say that the regex fails at the end of the string? I don't think so. Consider this.

最初, \ w + 将与 abc132z 匹配.然后,引擎前进到下一个令牌: \ d + .在此阶段,引擎在字符串中回溯,逐渐让 \ w + 放弃 2z (因此,现在 \ w + 仅对应到 abc13 ),从而允许 \ d + 匹配 2 .

Initially, \w+ will match abc132z. Then the engine advances to the next token: \d+. At this stage, the engine backtracks in the string, gradually letting the \w+ give up the 2z (so that the \w+ now only corresponds to abc13), allowing the \d+ to match 2.

在此阶段,由于离开了 z ,因此 $ 断言失败.引擎回溯,让 \ w + 放弃了 3 字符,然后放弃了 1 (这样, \ w + 仅对应于 abc ),最终允许 \ d + 匹配 132 .在每个步骤中,引擎都会尝试 $ 断言并失败.根据引擎内部情况,可能会发生更多的回溯: \ d + 将再次放弃2和3,然后 \ w + 将放弃c和b.当引擎最终放弃时, \ w + 仅匹配初始的 a .您可以说正则表达式在"3"上失败吗?在"b"上吗?

At this stage, the $ assertion fails as the z is left. The engine backtracks, letting the \w+, give up the 3 character, then the 1 (so that the \w+ now only corresponds to abc), eventually allowing the \d+ to match 132. At each step, the engine tries the $ assertion and fails. Depending on engine internals, more backtracking may occur: the \d+ will give up the 2 and the 3 once again, then the \w+ will give up the c and the b. When the engine finally gives up, the \w+ only matches the initial a. Can you say that the regex "fails on the "3"? On the "b"?

不.如果您从左到右查看正则表达式模式,则可以说它在 $ 上失败,因为它是我们无法添加到匹配项中的第一个标记.请记住,还有其他方法可以争论这一点.

No. If you're looking at the regex pattern from left to right, you can argue that it fails on the $, because it's the first token we were not able to add to the match. Bear in mind that there are other ways to argue this.

下,我将为您提供一个截图以直观显示.但是首先,让我们看看是否可以回答另一个问题.

Lower, I'll give you a screenshot to visualize this. But first, let's see if we can answer the other question.

其他问题

有没有可以让我们回答另一个问题的技术:

Are there techniques that allow us to answer the other question:

引擎在哪个正则表达式中无法匹配字符串?

At which token in the regex did the engine become unable to match the string?

这取决于您的正则表达式.如果能够将正则表达式切成干净的组件,则可以设计一个表达式,在捕获组内使用一系列可选的超前行,使匹配始终成功.第一个未设置的捕获组是导致失败的捕获组.

It depends on your regex. If you are able to slice your regex into clean components, then you can devise an expression with a series of optional lookaheads inside capture groups, allowing the match to always succeed. The first unset capture group is the one that caused the failure.

JavaScript在可选的lookaheads上有点小气,但是您可以这样写:

Javascript is a bit stingy on optional lookaheads, but you can write something like this:

^(?:(?=(\w+)))?(?:(?=(\w+\d+)))?(?:(?=(\w+\d+$)))?.

在PCRE,.NET,Python中,您可以编写得更紧凑一些:

In PCRE, .NET, Python... you could write this more compactly:

^(?=(\w+))?(?=(\w+\d+))?(?=(\w+\d+$))?.

这是怎么回事?每个前瞻都以最后一个为增量,每次添加一个令牌.因此,我们可以分别测试每个令牌.末尾的点是可选的,用于视觉反馈:我们可以在调试器中看到至少有一个字符已匹配,但是我们不在乎该字符,只在乎捕获组.

What happens here? Each lookahead builds incrementally on the last one, adding one token at a time. Therefore we can test each token separately. The dot at the end is an optional flourish for visual feedback: we can see in a debugger that at least one character is matched, but we don't care about that character, we only care about the capture groups.

  1. 第1组测试 \ w + 令牌
  2. 第2组似乎测试了 \ w + \ d + ,因此,它逐渐测试了 \ d + 令牌
  3. 第3组似乎测试了 \ w + \ d + $ ,因此,它逐步测试了 $ 令牌
  1. Group 1 tests the \w+ token
  2. Group 2 seems to test \w+\d+, therefore, incrementally, it tests the \d+ token
  3. Group 3 seems to test \w+\d+$, therefore, incrementally, it tests the $ token

共有三个捕获组.如果同时设置了三个,那么这场比赛将取得圆满成功.如果未设置第3组(与 abc123a 相同),则可以说 $ 导致了失败.如果设置了组1,但未设置组2(与 abc 一样),则可以说 \ d + 导致了失败.

There are three capture groups. If all three are set, the match is a full success. If only Group 3 is not set (as with abc123a), you can say that the $ caused the failure. If Group 1 is set but not Group 2 (as with abc), you can say that the \d+ caused the failure.

供参考:故障路径的内部视图

对于它的价值,这里是RegexBuddy调试器失败路径的视图.

For what it's worth, here is a view of the failure path from the RegexBuddy debugger.

这篇关于找出正则表达式失败的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆