Perl非贪婪问题 [英] perl non-greedy problem

查看:91
本文介绍了Perl非贪婪问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对非贪婪的正则表达式有疑问.我已经看到有一些关于非贪婪正则表达式的问题,但它们并不能解决我的问题.

I am having a problem with a non-greedy regular expression. I've seen that there are questions regarding non-greedy regex, but they don't answer to my problem.

问题::我正在尝试匹配大声笑"锚点的href.

Problem: I am trying to match the href of the "lol" anchor.

注意::我知道可以使用perl HTML解析模块来完成此操作,而我的问题是关于在perl中解析HTML.我的问题是关于正则表达式本身,而HTML只是一个例子.

Note: I know this can be done with perl HTML parsing modules, and my question is not about parsing HTML in perl. My question is about the regular expression itself and the HTML is just an example.

测试用例::我有4个针对.*?[^"]的测试. 2先产生预期的结果.但是,第三级没有,第四级只是,但是我不明白为什么.

Test case: I have 4 tests for .*? and [^"]. The 2 first produce the expected result. However the 3rd doesn't and the 4th just does but I don't understand why.

问题:

  1. 为什么第三次测试在.*?[^"]的两个测试中均失败?非贪婪的接线员不应该工作吗?
  2. 为什么第四项测试在.*?[^"]的两个测试中都起作用?我不明白为什么在前面加上.*会更改正则表达式. (除了前面的.*之外,第3和第4个测试是相同的.
  1. Why does the 3rd test fail in both tests for .*? and [^"] ? Shouldn't the non-greedy operator work?
  2. Why does the 4th test works in both tests for .*? and [^"] ? I don't understand why including a .* in front changes the regex. (the 3rd and 4th tests are the same except the .* in front).

我可能不太清楚这些正则表达式是如何工作的. perl食谱提到了一些内容,但我认为它不能解决我的问题问题.

I probably don't understand exactly how these regex work. A perl cookbook recipe mentions something but I don't think it answers my question.

use strict;

my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
  if $content =~ m~href="(.*?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
  if $content =~ m~.*href="(.*?)".*?>lol~s ;

print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";


print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
  if $content =~ m~href="([^"]+?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nNow with the '.*' in front it does.\n"
  if $content =~ m~.*href="([^"]+?)".*?>lol~s ;

推荐答案

尝试打印出$&(与整个正则表达式匹配的文本)以及$1.这可以让您更好地了解正在发生的事情.

Try printing out $& (the text matched by the entire regex) as well as $1. This may give you a better idea of what's happening.

您似乎遇到的问题是.*?并不意味着在所有使用最小字符的可能匹配项中找到匹配项".它的意思是首先,在这里尝试匹配0个字符,然后继续匹配正则表达式的其余部分.如果失败,请尝试匹配1个字符.如果其余正则表达式不匹配,请在此处尝试2个字符,依此类推. "

The problem you seem to have is that .*? does not mean "Find the match out of all possible matches that uses the fewest characters here." It just means "First, try matching 0 characters here, and go on to match the rest of the regex. If that fails, try matching 1 character. If the rest of the regex won't match, try 2 characters here. etc."

Perl总是 查找最接近字符串开头的匹配项.由于大多数模式都以href=开头,因此它将在字符串中找到第一个href=,并查看是否有任何方法可以扩展重复序列以从此处开始进行匹配.如果找不到匹配项,它将尝试从下一个href=开始,依此类推.

Perl will always find the match that starts closest to the beginning of the string. Since most of your patterns start with href=, it will find the first href= in the string and see if there's any way to expand the repetitions to get a match beginning there. If it can't get a match, it'll try starting at the next href=, and so on.

在正则表达式的开头添加贪婪的.*时,匹配开始于.*,它会捕获尽可能多的字符.然后,Perl回溯以找到href=.本质上,这导致它首先在字符串中尝试 last href=,并朝着字符串的开头进行工作.

When you add a greedy .* to the beginning of the regex, matching starts with the .* grabbing as many characters as it can. Perl then backtracks to find a href=. Essentially, this causes it to try the last href= in the string first, and work towards the beginning of the string.

这篇关于Perl非贪婪问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆