正则表达式非捕获组正在捕获 [英] Regex non-capturing group is capturing

查看:98
本文介绍了正则表达式非捕获组正在捕获的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个正则表达式

 (?: \< a [^ *] href =(http:/ /[^\"]+?|[^\"]+?\.pdf)\"+?[^>]*?)> 

这个正则表达式的意义在于捕获每个结束标记('>')一个以http://开头或以.pdf结尾的href。

正则表达式可以工作,但它捕获锚的第一部分,我绝对需要不捕获。



在下面的示例中,除了第二个(这很好)以外,所有匹配都是匹配的,但只有最后一个支架应该被捕获,并非如此。

 < a href =http:// blabla> omg< / a> 
< a href =blabla> omg< / a>
< a href =http://blabla.pdf> omg< / a>
< a href =/ blabla.pdf> omg< / a>

例如:如果我们取第一个匹配,那就是:

 < a href =http:// blabla> 

我只想捕获最后一个括号(我用圆括号括起来的那个):

 < a href =http:// blabla(&)

那么为什么非捕获组正在捕获?我怎样才能抓住锚点的最后一个支架



即使我简化了我的正则表达式,它仍然无法正常工作

 (?: \< a [^ *] href =http:// [^] ++ [^>] *)(> ;)

谢谢,

解决方案 大多数正则表达式风格允许您使用捕获组来抽取整个匹配的特定部分。整体比赛通常被称为第零个擒拿组,但这仅仅是一种调整方式。)



这听起来像是你想匹配一个整体< A> 标记,但只消耗最后的> 。这在大多数正则表达式中是不可能的,包括JavaScript但是如果您使用的是Perl或PHP,您可以使用 \ K 来欺骗匹配开始位置:

 < a\s + [^>] + HREF = HTTP:// [^(Ⅰ')?] +[^>] * \K> 

在.NET中,您可以使用lookbehead(与前瞻一样,不需要使用):

 (?i)(?<= ] +?href = HTTP:// [^ ] +[^>] *)> 

在支持lookbehinds的其他风格中,大多数地方限制了它们,导致它们无法用于此任务。

I have this regex

(?:\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)>

The point of this regex is to capture every closing tag ('>') of an anchor that has an href that starts with "http://" or ends with ".pdf".

The regex works, however it is capturing the first part of the anchor, which I absolutely need to NOT capture.

In the following samples all are matching except second (which is fine) but only the last bracket should be captured and it is not the case.

<a href="http://blabla">omg</a>
<a href="blabla">omg</a>
<a href="http://blabla.pdf">omg</a>
<a href="/blabla.pdf">omg</a>

For example: If we take the first match which is :

<a href="http://blabla">

I only want to capture the last bracket (the one I surounded with parenthesis) :

<a href="http://blabla"(>)

So why the non-capturing group is capturing? And how can I only grab the last bracket of the anchor

Even if I streamline my regex to the following, it still doesnt work

(?:\<a[^*]href="http://[^"]+"+[^>]*)(>)

Thank you,

解决方案

You're conflating two distinct concepts: capturing and consuming. Regexes normally consume whatever they match; that's just how they work. Additionally, most regex flavors let you use capturing groups to pluck out specific parts of the overall match. (The overall match is often referred to as the zero'th capturing group, but that's just a figure of speech.)

It sounds like you're trying to match a whole <A> tag, but only consume the final >. That's not possible in most regex flavors, JavaScript included. But if you're using Perl or PHP, you could use \K to spoof the match start position:

(?i)<a\s+[^>]+?href="http://[^"]+"[^>]*\K>

And in .NET you could use a lookbehind (which, like a lookahead, matches without consuming):

(?i)"(?<=<a\s+[^>]+?href="http://[^"]+"[^>]*)>

Of the other flavors that support lookbehinds, most place restrictions on them that render them unusable for this task.

这篇关于正则表达式非捕获组正在捕获的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆