从 Grep 正则表达式中捕获组 [英] Capturing Groups From a Grep RegEx

查看:153
本文介绍了从 Grep 正则表达式中捕获组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 sh (Mac OSX 10.6) 中有这个小脚本来查看文件数组.谷歌在这一点上已经停止提供帮助:

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:

files="*.jpg"
for f in $files
    do
        echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
        name=$?
        echo $name
    done

到目前为止(显然,对于您的 shell 大师)$name 仅包含 0、1 或 2,这取决于 grep 是否发现文件名与提供的内容匹配.我想要的是捕获括号内的内容 ([a-z]+) 并将其存储到变量中.

So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.

如果可能,我希望只使用grep.如果没有,请不要使用 Python 或 Perl 等.sed 或类似的东西 - 我是 shell 的新手,想从 *nix 纯粹主义角度对此进行攻击.

I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I'm new to shell and would like to attack this from the *nix purist angle.

此外,作为一个超酷的福利,我很好奇如何在 shell 中连接字符串?我捕获的组是存储在$name 中的字符串somename",我想在它的末尾添加字符串.jpg",我可以cat $name '.jpg' 吗?

Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?

如果你有时间,请解释发生了什么.

Please explain what's going on, if you've got the time.

推荐答案

如果你使用 Bash,你甚至不必使用 grep:

If you're using Bash, you don't even have to use grep:

files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files    # unquoted in order to allow the glob to expand
do
    if [[ $f =~ $regex ]]
    then
        name="${BASH_REMATCH[1]}"
        echo "${name}.jpg"    # concatenate strings
        name="${name}.jpg"    # same thing stored in a variable
    else
        echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
    fi
done

最好将正则表达式放在变量中.如果按字面意思包含某些模式将不起作用.

It's better to put the regex in a variable. Some patterns won't work if included literally.

这里使用 =~ 这是 Bash 的正则表达式匹配运算符.匹配结果保存在一个名为 $BASH_REMATCH 的数组中.第一个捕获组存储在索引 1 中,第二个(如果有)存储在索引 2 中,依此类推.索引 0 是完全匹配.

This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.

您应该知道,如果没有锚点,此正则表达式(以及使用 grep 的正则表达式)将匹配以下任何示例以及更多示例,这些示例可能不是您要查找的内容:

You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:

123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz

为了消除第二个和第四个例子,让你的正则表达式像这样:

To eliminate the second and fourth examples, make your regex like this:

^[0-9]+_([a-z]+)_[0-9a-z]*

表示字符串必须以一位或多位数字开始.克拉代表字符串的开始.如果在正则表达式的末尾添加美元符号,如下所示:

which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:

^[0-9]+_([a-z]+)_[0-9a-z]*$

那么第三个例子也将被删除,因为点不在正则表达式中的字符中,而美元符号代表字符串的结尾.请注意,第四个示例也未能通过此匹配.

then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.

如果您有 GNU grep(我认为大约 2.5 或更高版本,当添加了 K 运算符时):

If you have GNU grep (around 2.5 or later, I think, when the K operator was added):

name=$(echo "$f" | grep -Po '(?i)[0-9]+_K[a-z]+(?=_[0-9a-z]*)').jpg

K 运算符(可变长度后视)使前面的模式匹配,但不包括结果中的匹配项.等价的固定长度是 (?<=) - 该模式将包含在右括号之前.如果量词可能匹配不同长度的字符串,则必须使用 K(例如 +*{2,4}).

The K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).

(?=) 运算符匹配固定或可变长度的模式,称为前瞻".它也不在结果中包含匹配的字符串.

The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.

为了使匹配不区分大小写,使用了 (?i) 运算符.它会影响跟随它的模式,因此它的位置很重要.

In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.

可能需要根据文件名中是否有其他字符来调整正则表达式.您会注意到,在本例中,我展示了一个在捕获子字符串的同时连接字符串的示例.

The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.

这篇关于从 Grep 正则表达式中捕获组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆