正则表达式:更懒吗? [英] Regex: Is Lazy Worse?

查看:94
本文介绍了正则表达式:更懒吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我总是这样写正则表达式

I have always written regexes like this

<A HREF="([^"]*)" TARGET="_blank">([^<]*)</A>

但是我刚刚了解了这个懒惰的东西,我可以这样写这个

but I just learned about this lazy thing and that I can write it like this

<A HREF="(.*?)" TARGET="_blank">(.*?)</A>

使用第二种方法是否有任何不利之处? regex绝对更紧凑(即使SO解析也更好).

is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).

编辑:这里有两个最佳答案,指出了两个表达式之间的两个重要区别. ysth的答案指出了非贪婪/懒惰的弱点,在这种弱点中,超链接本身可能包含A标签的其他属性(肯定不好).罗伯·肯尼迪(Rob Kennedy)指出了贪婪示例中的一个弱点,即锚文本不能包含其他标签(肯定不行,因为它也不会捕获所有锚文本)...所以答案是,正则表达式就是它们的含义.是,看似相同的惰性解决方案和非惰性解决方案在语义上可能并不等效.

Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.

编辑:最好的答案是Alan M关于表达式的相对速度.就目前而言,我会将他标记为最佳答案,以便其他人给他更多分数:)

Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)

推荐答案

还要考虑的另一件事是目标文本有多长时间,以及量化子表达式将匹配多少目标文本.例如,如果您试图匹配大型HTML文档中的整个< BODY>元素,则可能会想使用此正则表达式:

Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:

/<BODY>.*?<\/BODY>/is

但这将做很多不必要的工作,一次匹配一个字符,同时在每个字符之前进行消极的前瞻.您知道</BODY>标记将非常接近文档的结尾,因此明智的做法是使用普通的贪婪量化器.让它吞噬整个文档的其余部分,然后回溯匹配结束标记所需的几个字符.

But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.

在大多数情况下,您不会注意到贪婪量词和勉强量词之间的速度差异,但这是需要牢记的.您应该明智地使用不情愿"量词的主要原因是其他人指出的理由:他们可能勉强做到这一点,但是如果这是实现总体目标所需要的,它们将比您想要的匹配更多匹配.

In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.

这篇关于正则表达式:更懒吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆