正则表达式中的非捕获组是什么? [英] What is a non-capturing group in regular expressions?

查看:134
本文介绍了正则表达式中的非捕获组是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正则表达式中如何使用非捕获组,即(?:),它们有什么用?

解决方案

让我尝试用示例进行解释。



请考虑以下文本:

  http://stackoverflow.com/ 
https://stackoverflow.com/questions/tagged/正则表达式

现在,如果我在下面应用正则表达式...

 (https?| ftp)://([^ / \r\n] +)(/ [^ \r\n] *)? 

...我将得到以下结果:

 匹配 http://stackoverflow.com/ 
组1: http
组2: stackoverflow.com
第3组: /

匹配 https://stackoverflow.com/questions/tagged/regex
第1组: https
第2组: stackoverflow。 com
第3组: / questions / tagged / regex

但是我不不在乎协议-我只想要URL的主机和路径。因此,我将正则表达式更改为包括非捕获组(?:)

 (?:https?| ftp)://([[^ / \r\n] +)(/ [^ \r\n] *)? 

现在,我的结果如下:

 匹配 http://stackoverflow.com/ 
组1: stackoverflow.com
组2: /

匹配 https://stackoverflow.com/questions/tagged/regex
组1: stackoverflow.com
组2: / questions / tagged / regex

看到了吗?第一组尚未被捕获。解析器使用它来匹配文本,但在最终结果中稍后将忽略它。






编辑:



根据要求,让我也尝试解释群组。



群组有很多用途。它们可以帮助您从更大的匹配项(也可以命名)中提取确切的信息,可以让您重新匹配先前匹配的组,并可以用于替换。让我们尝试一些示例,对吧?



假设您有某种XML或HTML(请注意 regex可能不是这项工作的最佳工具,但这很好。您想解析标签,因此可以执行以下操作(我添加了空格以使其更易于理解):

  \((?< TAG>。+?)\> [^<] *? \ / \k< TAG> \> 

((。+?)\> [^<] *? \< / \1\>

第一个正则表达式具有命名组(TAG),而第二个正则表达式具有公共组。这两个正则表达式执行相同的操作:它们使用第一组中的值(标签名称)来匹配结束标签。区别在于,第一个使用名称来匹配值,第二个使用组索引(从1开始)。



现在尝试进行一些替换。请考虑以下文字:

  Lorem ipsum dolor sit a consectetuer feugiat fame malesuada pretium egestas。 

现在,让我们在它上面使用这个愚蠢的正则表达式:

  \b(\S)(\S)(\S)(\S *)\b 

此正则表达式匹配至少包含3个字符的单词,并使用组来分隔前三个字母。结果是这样的:

 匹配 Lorem 
组1: L
组2: o
第3组: r
第4组: em
匹配 ipsum
第1组: i
第2组: p
第3组: s
第4组: um
...

匹配 consectetuer
第1组: c
组2: o
组3: n
组4: secettuer
...

因此,如果我们应用替换字符串:

  $ 1_ $ 3 $ 2_ $ 4 

...在它上面,我们正在尝试要使用第一组,请添加下划线,使用第三组,然后使用第二组,再添加下划线,然后使用第四组。产生的字符串将类似于下面的字符串。 p_er_tium e_eg_stas。

您也可以使用命名组进行替换,使用 $ {name}



要使用正则表达式,我建议 http://regex101.com/ ,其中提供了有关正则表达式工作原理的大量详细信息;它还提供了一些正则表达式引擎供您选择。


How are non-capturing groups, i.e. (?:), used in regular expressions and what are they good for?

解决方案

Let me try to explain this with an example.

Consider the following text:

http://stackoverflow.com/
https://stackoverflow.com/questions/tagged/regex

Now, if I apply the regex below over it...

(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

... I would get the following result:

Match "http://stackoverflow.com/"
     Group 1: "http"
     Group 2: "stackoverflow.com"
     Group 3: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "https"
     Group 2: "stackoverflow.com"
     Group 3: "/questions/tagged/regex"

But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:).

(?:https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

Now, my result looks like this:

Match "http://stackoverflow.com/"
     Group 1: "stackoverflow.com"
     Group 2: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
     Group 1: "stackoverflow.com"
     Group 2: "/questions/tagged/regex"

See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.


EDIT:

As requested, let me try to explain groups too.

Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let's try some examples, shall we?

Imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
   \<(.+?)\> [^<]*? \</\1\>

The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).

Let's try some substitutions now. Consider the following text:

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

Now, let's use this dumb regex over it:

\b(\S)(\S)(\S)(\S*)\b

This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:

Match "Lorem"
     Group 1: "L"
     Group 2: "o"
     Group 3: "r"
     Group 4: "em"
Match "ipsum"
     Group 1: "i"
     Group 2: "p"
     Group 3: "s"
     Group 4: "um"
...

Match "consectetuer"
     Group 1: "c"
     Group 2: "o"
     Group 3: "n"
     Group 4: "sectetuer"
...

So, if we apply the substitution string:

$1_$3$2_$4

... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

You can use named groups for substitutions too, using ${name}.

To play around with regexes, I recommend http://regex101.com/, which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.

这篇关于正则表达式中的非捕获组是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆