为什么正则表达式捕获组索引在一个? [英] Why are regex capturing groups indexed at one?

查看:132
本文介绍了为什么正则表达式捕获组索引在一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的一部分担心这个问题会被关闭,但我真的很困惑。在我使用的每种语言的正则表达式中,捕获组被索引为一个,即使其余的语言被索引为零。我想到了一些导致1索引的设计决策,这通常会降低非技术人员进入的障碍,但是正是这个已经是地狱和不可理解的正则表达式,这个观点似乎并不是



此外,由于每种语言似乎在正则表达式上都有自己的小小调整,因此将捕捉组索引与其余部分保持一致似乎是合理的语言。



还有其他的解释吗?这个想法突然出现在我的头脑中,1索引是正则表达式内部更深入的内容(像固有地占用零点的东西)或者这些线索的东西的结果。也就是说,我没有找到关于这个特殊怪癖的任何文件。有没有任何正则表达式的大师知道这里有更深层次的东西,还是只是真正的遗留代码?

解决方案

blockquote>

在我使用的每种语言的正则表达式中,捕获组的索引为
,即使其余的语言被索引为零。


我猜,使用其余的语言,你的意思是数组和其他容器类型。那么在正则表达式中,捕获组起始于 0 ,但起初并不明显。



em> capture group 0 ,包含完整的匹配,并且其上的捕获组是您可以使用括号 - ()创建的组。 / p>

所以,在下面的正则表达式中,对于字符串 - ab123cd

  ab(\d +)cd 

真的有两组:




  • 组0 - 完全匹配 - ab123cd
  • 组1 - 您使用()捕获的组 - 123



其中,组按顺序编号



所以,对于正则表达式(Whitespaces加入可读性):

  ab(x(\d +))cd 
^ ^
| |
组1组2

将上述正则表达式应用于字符串 - abx123cd,您将拥有以下组:




  • 组0 - 完成比赛 - abcx123cd

  • 组1 - 第一个开始圆括号中的模式 - x123

  • 组2 - 第二个开始圆括号中的模式 - 123






当您将这些正则表达式映射到 Java 时,可以获取所有这些组使用以下方法:




Part of me is worries that this question will get closed, but I'm genuinely baffled by something. In every language's regex that I've used, the capturing groups are indexed at one, even when the rest of the language is indexed at zero. I thought of design decisions that would lead to 1-indexing, which is usually to lower the barrier to entry for non-technical people, however when it comes to regex, which is already hellish and incomprehensible, this argument doesn't really seem to hold.

Additionally, since each language seems to have its own small tweaks on regex, it seems like it would be sensible to have capturing group indexing be consistent with the rest of the language.

Is there some other explanation? The idea has popped into my head that the 1-indexing is a result of something deeper within the belly of regex (like something inherently taking up the zero spot) or something along those lines. That said, I wasn't able to find any documentation on this particular quirk. Are there any regex masters out there that are aware of something deeper going on here, or is it just something in seriously legacy code?

解决方案

In every language's regex that I've used, the capturing groups are indexed at one, even when the rest of the language is indexed at zero.

I guess, by rest of the language you mean, arrays and other container types. Well, in regex, capture groups do start with 0, but it is not obvious at first.

The capture group 0, contains the complete match, and the capture groups thereon, are the groups that you can see as created using parenthesis - ().

So, in the below regex, for string - "ab123cd":

ab(\d+)cd

There are really two groups:

  • Group 0 - Is complete match - ab123cd
  • Group 1 - Is the group you captured using () - 123

There on, the groups are numbered in the order of occurrence of opening parenthesis (.

So, for the below regex (Whitespaces added to readability):

ab(    x   (\d+))cd
  ^        ^
  |        |
 group 1  group 2

When applying the above regex to string - "abx123cd", you will have following groups:

  • Group 0 - Complete match - abcx123cd
  • Group 1 - Pattern in first opening parenthesis - x123
  • Group 2 - Pattern in 2nd opening parenthesis - 123

When you map those regex in Java, you can get all those groups using the following methods:

  • Matcher.group() to get group 0 (Note, there are no parameters), and
  • Matcher.group(int) to get rest of the groups (Note an int parameter, taking value for respective group)

这篇关于为什么正则表达式捕获组索引在一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆