模式向后引用到可选的捕获子表达式 [英] Pattern backreference to an optional capturing subexpression

查看:44
本文介绍了模式向后引用到可选的捕获子表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图使用Bash的内置正则表达式匹配来解析以下类型的字符串,这些字符串将转换为Perl替换表达式(引号不是数据的一部分)

In an attempt to use Bash's built-in regular expression matching to parse the following types of strings, which are to be converted to Perl substitution expressions (quotes are not part of data)

'~#A#B#'
#^ ^ ^-- Replacement string.
#| +---- Pattern string.
#+------ Regular expression indicator (no need to escape strings A and B),
#        which is only allowed if strings A and B are surrounded with ##.
#        Strings A and B may not contain #, but are allowed to have ~.

'#A#B#'
#^------ When regex indicator is missing, strings A and B will be escaped.

'A#B'
#        Simplified form of '#A#B#', i. e. without the enclosing ##.
#        Still none of the strings A and B is allowed to contain # at any position,
#        but can have ~, so leading ~ should be treated as part of string A.

我尝试了以下模式(同样,不带引号):

I tried the following pattern (again, without quotes):

'^((~)?(#))?([^#]+)#([^#]+)\3$'

也就是说,它声明前导〜#是可选的(并且其中的甚至是可选的),然后捕获部分 A B ,并且要求尾随的#仅在前导中存在时才存在.捕获前导#仅用于反向引用匹配-在其他地方则不需要,而捕获后供脚本检查.

That is, it declares the leading ~# optional (and ~ in it even more optional), then captures parts A and B, and requires the trailing # to be present only if it was present in the leader. The leading # is captured for backreference matching only — it is not needed elsewhere, while ~ is captured to be inspected by script afterwards.

但是,该模式仅适用于最完整的输入数据类型:

However, that pattern only works as expected with the most complete types of input data:

'~#A#B#'
'#A#B#'

但不是

'A#B'

I.例如,每当缺少前导部分时, \ 3 都不匹配.但是,如果将 \ 3 替换为.* ,则匹配成功,并且可以看出 $ {BASH_REMATCH [3]} 为空细绳.如果未设置的变量在Bash中被视为空字符串,这是我不理解的事情.然后如何将反向引用与可选内容进行匹配?

I. e., whenever the leading part is missing, \3 fails to match. But if \3 is replaced with .*, the match succeeds and it can be seen that ${BASH_REMATCH[3]} is an empty string. This is something that I do not understand, provided that unset variables are treated as empty strings in Bash. How do I match a backreference with optional content then?

作为一种解决方法,我可以编写一个替代模式

As a workaround, I could write an alternative pattern

'^(~?)#([^#]+)#([^#]+)#$|^([^#]+)#([^#]+)$'

但是对于每种可能的情况,它都会导致不同的捕获组,这使代码不太直观.

but it results in distinct capture groups for each possible case, which makes the code less intuitive.

重要说明.正如@anubhava在评论中提到的那样,反向引用匹配可能在某些Bash构建中不可用(也许这是构建选项的问题,而不是版本号,甚至是某些外部库).这个问题当然是针对那些支持这种功能的Bash环境.

Important note. As @anubhava mentioned in his comment, backreference matching may not be available in some Bash builds (perhaps it is a matter of build options rather than of version number, or even of some external library). This question is of course targeted at those Bash environments that support such functionality.

推荐答案

有两种方法可以解决此问题:

There are two ways to deal with this problem:

  1. 与其将组设为可选(换句话说,使其完全不匹配),还必须使其成为必需,但匹配空字符串.换句话说,将诸如(#)?之类的结构更改为(#?).

仅在第3组匹配时,才使用条件匹配后向引用 \ 3 .为此,请将 \ 3 更改为(?(3)#|).

Use a conditional to match the backreference \3 only if group 3 matched. To do this, change \3 to (?(3)#|).

通常,第一个选项是更可取的,因为它具有更好的可读性.另外,bash的正则表达式似乎不支持条件构造,因此我们需要使选项1起作用.由于存在附加条件,即仅当还存在#时,才允许使用,因此这很困难.如果bash支持前瞻,我们可以执行((〜)(?:#))?(#?)之类的操作.但是,既然没有,我们就需要发挥创造力.我提出了以下模式:

Generally, the first option is preferable because of its better readability. Also, bash's regular expressions don't seem to support conditional constructs, so we need to make option 1 work. This is difficult because of the additional condition that ~ is only allowed if # is also present. If bash supported lookaheads, we could do something like ((~)(?:#))?(#?). But since it doesn't, we need to get creative. I've come up with the following pattern:

^((~(#))|(#?))([^#]+)#([^#]+)(\3|\4)$

演示.

这个想法是利用交替运算符 | 处理两种不同的情况:文本以〜#开头,或者不是.((〜(#))|(#?))在可能的情况下捕获组2中的〜#,在组3中捕获#没有,那么它仅捕获第4组中的#(如果存在).然后我们可以在如果有一个开头,则结束以匹配结尾的#(请记住,如果文本以〜#开头,则第3组捕获了#,并且组4捕获了#;如果文本不是 not 〜#开头,则为空字符串.

The idea is to make use of the alternation operator | to handle two different cases: Either the text starts with ~#, or it doesn't. ((~(#))|(#?)) captures ~# in group 2 and # in group 3 if possible, but if there's no ~ then it just captures # (if present) in group 4. Then we can use (\3|\4) at the end to match the closing #, if there was an opening one (remember, group 3 captured # if the text started with ~#, and group 4 captured # or the empty string if the text did not start with ~#).

这篇关于模式向后引用到可选的捕获子表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆