Perl正则表达式匹配较长句子中的可选短语 [英] Perl regex matching optional phrase in longer sentence
问题描述
我正在尝试匹配句子中的可选(可能存在)短语:
I'm trying to match an optional (possibly present) phrase in a sentence:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*(word2)?.*(word3)/'
输出:
1:word1 2: 3:word3
我知道第一个 '.*' 是贪婪的并且匹配到 'word3' 的所有内容.让它不贪婪也无济于事:
I know the first '.*' is being greedy and matching everything up to 'word3'. Making it non-greedy doesn't help:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*?(word2)?.*(word3)/'
输出:
1:word1 2: 3:word3
这里似乎存在利益冲突.我原以为 Perl 会匹配 (word2)?如果可能,仍然满足非贪婪的 .*?.至少这是我对?"的理解.Perl 正则表达式页面显示?"产生 1 次或 0 次,所以它不应该更喜欢一场比赛而不是零次吗?
There seems to be a conflict of interest here. I would have thought Perl would match (word2)? if possible and still satify the non-greedy .*?. At least that's my understanding of '?'. The Perl regex page says '?' makes 1 or zero times so shouldn't it prefer one match rather than zero?
更令人困惑的是,如果我捕获 .*?:
Even more confusing is if I capture the .*?:
perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'
输出:
1:word1 2: 3: 4:word3
这里的所有组都是捕获组,所以我不知道为什么它们是空的.
All groups here are capturing groups so I don't know why they are empty.
只是为了确保不会捕获词间空间:
Just to make sure the inter-word space isn't being captured:
perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'
输出:
1:word1 2: 3: 4:word3
鉴于唯一未捕获的匹配项是 word2 和 word3 之间的匹配项,我只能假设它是进行匹配的匹配项.果然:
Given the only match not capturing is the one between word2 and word3 I can only assume that it's the one doing the matching. Sure enough:
perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4 5:$5\n" if m/(word1)(.*?)(word2)?(.*)(word3)/'
输出:
1:word1 2: 3: 4:_word2_ 5:word3
所以贪婪匹配是反向工作的,Perl 很乐意匹配 word2 的零个(而不是一个)实例.让它不贪婪也无济于事.
So the greedy matching is working backwards, and Perl is happy to match zero (rather than one) instance of word2. Making it non-greedy doesn't help either.
所以我的问题是:如何编写正则表达式来匹配和捕获句子中可能的短语?我这里给出的例子很简单;我解析的实际句子要长得多,我匹配的句子之间有很多词,所以我不能假设中间文本的长度或组成.
So my question is: how can I write my regex to match and capture a possible phrase in a sentence? My examples given here are simplistic; the actual sentence I am parsing is much longer with many words between those I am matching, so I can't assume any length or composition of intervening text.
非常感谢,斯科特
推荐答案
背景:懒惰和贪婪的量词是如何工作的
您需要了解贪婪量词和惰性量词的工作原理.贪婪的人会立即抓取他们的模式可以匹配的文本,然后引擎将回溯,即它会尝试回到贪婪量化的子模式与子字符串匹配的地方,尝试检查是否可以匹配下一个子模式.
BACKGROUND: HOW LAZY AND GREEDY QUANTIFIERS WORK
You need to understand how greedy and lazy quantifiers work. Greedy ones will grab the text their patterns can match at once, and then the engine will backtrack, i.e. it will try to go back to the place where the greedily quantified subpattern matched the substring, trying to check if the next subpattern can be matched.
懒惰匹配模式只是先匹配最少的字符,然后再尝试匹配其余的子模式.使用 *?
,它匹配 零 个字符,一个空格,然后继续检查下一个模式是否可以匹配,如果不能匹配,则为惰性子模式将被扩展"再包含一个字符,依此类推.
Lazy matching patterns just match the minimum characters first, and then tries to match with the rest of the subpatterns. With *?
, it matches zero characters, an empty space, and then goes on to check if the next pattern can be matched, and only if it cannot, the lazy subpattern will be "expanded" to include one more character, and so on.
所以,(word1).*(word2)?.*(word3)
会将 word2
与第一个 .*
(第二个 .*
将匹配一个空格,因为第一个 .*
是贪婪的.虽然你可以认为 (word2)?
是贪婪的因此必须回溯到,答案是否定的,因为第一个 .*
抓取了所有的字符串,然后引擎向后寻找匹配.自 (word2)?
code> 匹配一个空字符串,它总是匹配的,并且 word3
从字符串的末尾开始首先匹配.参见 这个演示并检查regex debugger部分.
So, (word1).*(word2)?.*(word3)
will match the word2
with the first .*
(and the second .*
will match an empty space as the first .*
is greedy. Although you can think that (word2)?
is greedy and thus must be backtracked to, the answer is no, because the first .*
grabbed all the string, and then the engine went backwards looking for the match. Since (word2)?
matches an empty string, it always matched, and word3
was matched first from the end of the string. See this demo and check the regex debugger section.
您认为,让我们对第一个 .\*?
使用 延迟匹配.(word1).*?(word2)?.*(word3)
(将 word2
与第二个 .*
匹配)有点不同,因为它可以匹配可选组.如何?第一个 .*?
匹配零个字符,然后尝试匹配所有后续子模式.因此,它找到了word1
,然后是一个空字符串,并且在word1
之后没有找到word2
.如果 word2
紧跟在 word1
之后,则会与第一个 .*?
匹配. 参见
You thought, let's use lazy matching with the first .\*?
. The problem with (word1).*?(word2)?.*(word3)
(that matches word2
with the second .*
that is greedy) is a bit different as it could match the optional group. How? The first .*?
matches zero characters, then tries to match all subsequent subpatterns. Thus, it found word1
, then an empty string, and did not find the word2
right after word1
. If word2
were right after word1
, there would be a match with the first .*?
. See this demo.
目前我看到了两种解决方案,它们都包括使第二个可选组独占"对于模式的其余部分,如果找到,正则表达式引擎无法跳过它.
There are two solutions that I see at this moment, and they both consist in making the second optional group "exclusive" for the rest of the pattern, so that the regex engine could not skip it if found.
- A 分支重置 以上由 Casimir 提供的解决方案.它的缺点是它不能移植到许多其他不支持分支重置的正则表达式风格.请参阅原始答案中的说明.
- 使用温和的贪婪令牌:
(word1)(?:(?!word2).)*(word2)?.*?(word3)
.它比分支重置解决方案效率低,但可以移植到 JS、Python 和大多数其他支持前瞻的正则表达式风格.这是如何运作的?(?:(?!word2).)*
匹配除换行符(带有/s
,甚至包括换行符)以外的任何不以开头的字符的 0+ 次出现文字字符序列word2
.如果w
匹配,则其后面不能跟ord2
以使构造匹配.因此,当它到达word2
时,它停止并让后续子模式 -(word2)?
- 匹配并捕获以下word2
.为了使这种方法更有效*,请使用展开循环技术:(word1)[^w]*(?:w(?!ord2)[^w]*)*(word2)?.*?(word3)
.
- A branch reset solution provided by Casimir above. Its disadvantage is that it cannot be ported to many other regex flavors that do not support branch reset. See description in the original answer.
- Use a tempered greedy token:
(word1)(?:(?!word2).)*(word2)?.*?(word3)
. It is less efficient than the branch reset solution, but can be ported to JS, Python, and most other regex flavors supporting lookaheads. How does that work?(?:(?!word2).)*
matches 0+ occurrences of any character other than a newline (with/s
, even including a newline) that does not start a literal character sequenceword2
. Ifw
is matched, it cannot be followed withord2
for the construct to match. Thus, when it reachesword2
, it stops and lets the subsequent subpattern -(word2)?
- match and capture the followingword2
. To make this approach more efficient*, use unroll the loop technique:(word1)[^w]*(?:w(?!ord2)[^w]*)*(word2)?.*?(word3)
.
这篇关于Perl正则表达式匹配较长句子中的可选短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!