正则表达式交替顺序 [英] Regex Alternation Order

查看:105
本文介绍了正则表达式交替顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我设置了一个复杂的正则表达式来从一页文本中提取数据.出于某种原因,交替的顺序不是我所期望的.一个简单的例子是:

I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be:

((13th|(Executive |Residential)|((\w+) ){1,3})Floor)

简单地说,我想要么得到一个楼层号,一个已知的命名楼层,作为备份,我捕获 1-3 个未知单词,然后是 floor 以备以后查看(我实际上使用了一个组名确定这一点,但不想混淆问题)

Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue)

问题是字符串是否为

on the 13th Floor

我没有得到 13th Floor 我得到 on the 13th Floor 这似乎表明它与第三个交替匹配.我原以为它会匹配 13 楼.我专门设置了这个(或者我认为)以优先考虑匹配的类型,并且只有在错过其他匹配时才将模糊的保留在最后.我猜当他们说 Regex 是贪婪时,他们不是在开玩笑,但我不清楚如何将其设置为贪婪"并按照我想要的方式行事.

I don't get 13th Floor I get on the 13th Floor which seems to indicate it is matching the 3rd alternation. I'd have expected it to match 13th Floor. I set this up specifically (or so I thought) to prioritize the types of matches and leave the vague ones for last only if the others are missed. I guess they weren't kidding when they said Regex is greedy but I am unclear how to set this up to be 'greedy' and behave the way I want.

推荐答案

一个自动机值一千字:

玩起来

您的问题是您在交替中使用了贪婪的 \w+ 子正则表达式.因为正如@rigderunner 在他的评论中所说的那样,NFA 匹配最长的最左边的子字符串,\w+ 将始终匹配 Floor 之前的任何内容,无论它是一个系列词,或 13thExecutiveResidential 或其中三个.括号不会改变交替的行为方式.

Your problem is that you're using a greedy \w+ sub-regex in your alternation. Because as @rigderunner is stating in his comment, a NFA is matching the longest leftmost substring, the \w+ will always match anything that comes before Floor, whether it is a series of words, or 13th or Executive or Residential or the three of them. The parenthesis are not changing how the alternation behaves.

因此,您不希望它匹配的最坏情况是:

So the worst case scenario it matches that you don't want it to match is:

xxxx yyyy zzz tttt Floor

您的正则表达式的问题在于您希望做一些实际正则 表达式无法做到的事情:如果替代方案不起作用,您希望它匹配单词.由于常规语言无法跟踪状态,常规正则表达式无法表达这一点.

The problem with your regex is that you expect to do something that actual regular expressions can't do: you're expecting it to match words if the alternatives did not work out. Because a regular language can't keep track of status, regular regex can't express this.

我实际上不确定使用某种前瞻是否可以帮助您在一个正则表达式中做到这一点,即使可以,您最终也会得到一个非常复杂、不可读甚至可能效率低下的正则表达式.

I'm actually not sure if using some kind of look ahead could help you do this in one regex, and even if you can, you'll end up with a very complicated, unreadable and maybe even not efficient regex.

因此,您可能更喜欢使用两个正则表达式,并从第二个正则表达式中获取组,以防第一个正则表达式失败:

So you may prefer to use two regex instead, and get the groups from the second regex in case the first one failed:

((13th|Executive|Residential) +Floor)

如果没有匹配

((\w+ +){1:3}Floor)

注意:为避免重复我自己,请查看其他答案,其中我列出了一些有趣的资源讲授正则表达式和 NFA.这将帮助您了解正则表达式的实际工作原理.

N.B.: to avoid repeating myself, please have a look at that other answer where I give a list of interesting resources lecturing on regex and NFA. That will help you understand how regex actually works.

这篇关于正则表达式交替顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆