外壳脚本.如何使用正则表达式提取字符串 [英] shell script. how to extract string using regular expressions

查看:32
本文介绍了外壳脚本.如何使用正则表达式提取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 shell 脚本的新手.我想使用 curl 发送一个 http 请求,然后使用正则表达式提取一些字符串.例如,如何从 http 响应中提取域名?(示例仅供学习)

I am new to shell scripts. I want to send a http request using curl and then extract some string using regular expressions. For example, how can I extract a domain name from a http response? (The example is for learning purposes only)

#!/bin/bash
name=$(curl google.com | grep "www..*com")
echo "domain name is"
echo $name

推荐答案

使用 bash 正则表达式:

re="http://([^/]+)/"
if [[ $name =~ $re ]]; then echo ${BASH_REMATCH[1]}; fi

编辑 - OP 要求解释语法.正则表达式语法是一个很大的话题,我无法在这里完整解释,但我会尝试解释到足以理解示例.

Edit - OP asked for explanation of syntax. Regular expression syntax is a large topic which I can't explain in full here, but I will attempt to explain enough to understand the example.

re="http://([^/]+)/"

这是存储在 bash 变量 re 中的正则表达式 - 即您希望输入字符串匹配的内容,并希望提取子字符串.分解:

This is the regular expression stored in a bash variable, re - i.e. what you want your input string to match, and hopefully extract a substring. Breaking it down:

  • http:// 只是一个字符串 - 输入字符串必须包含这个子字符串,正则表达式才能匹配
  • [] 通常使用方括号表示匹配括号内的任何字符".所以 c[ao]t 将匹配cat"和cot".[] 中的 ^ 字符将其修改为匹配方括号内的任何字符 except.所以在这种情况下 [^/] 将匹配除/"之外的任何字符.
  • 方括号表达式只会匹配一个字符.在其末尾添加 + 表示匹配 1 个或多个前面的子表达式".所以 [^/]+ 匹配 1 个或多个所有字符的集合,不包括/".
  • 在子表达式周围放置 () 括号表示您想要保存与该子表达式匹配的任何内容以供以后处理.如果您使用的语言支持这一点,它将提供一些机制来检索这些子匹配.对于 bash,它是 BASH_REMATCH 数组.
  • 最后,我们对/"进行完全匹配,以确保我们一直匹配到完全限定域名的结尾和后面的/"
  • http:// is just a string - the input string must contain this substring for the regular expression to match
  • [] Normally square brackets are used say "match any character within the brackets". So c[ao]t would match both "cat" and "cot". The ^ character within the [] modifies this to say "match any character except those within the square brackets. So in this case [^/] will match any character apart from "/".
  • The square bracket expression will only match one character. Adding a + to the end of it says "match 1 or more of the preceding sub-expression". So [^/]+ matches 1 or more of the set of all characters, excluding "/".
  • Putting () parentheses around a subexpression says that you want to save whatever matched that subexpression for later processing. If the language you are using supports this, it will provide some mechanism to retrieve these submatches. For bash, it is the BASH_REMATCH array.
  • Finally we do an exact match on "/" to make sure we match all the way to end of the fully qualified domain name and the following "/"

接下来,我们必须根据正则表达式测试输入字符串,看看它是否匹配.我们可以使用 bash 条件来做到这一点:

Next, we have to test the input string against the regular expression to see if it matches. We can use a bash conditional to do that:

if [[ $name =~ $re ]]; then
    echo ${BASH_REMATCH[1]}
fi

在 bash 中,[[ ]] 指定扩展条件测试,并且可能包含 =~ bash 正则表达式运算符.在这种情况下,我们测试输入字符串 $name 是否与正则表达式 $re 匹配.如果它匹配,那么由于正则表达式的构造,我们可以保证我们将有一个子匹配(来自括号()),我们可以使用 BASH_REMATCH 数组访问它:

In bash, the [[ ]] specify an extended conditional test, and may contain the =~ bash regular expression operator. In this case we test whether the input string $name matches the regular expression $re. If it does match, then due to the construction of the regular expression, we are guaranteed that we will have a submatch (from the parentheses ()), and we can access it using the BASH_REMATCH array:

  • 这个数组的元素 0 ${BASH_REMATCH[0]} 将是正则表达式匹配的整个字符串,即 "http://www.google.com/".
  • 此数组的后续元素将是子匹配的后续结果.请注意,正则表达式中可以有多个子匹配 () - BASH_REMATCH 元素将按顺序对应于这些.因此,在这种情况下,${BASH_REMATCH[1]} 将包含www.google.com",我认为这是您想要的字符串.
  • Element 0 of this array ${BASH_REMATCH[0]} will be the entire string matched by the regular expression, i.e. "http://www.google.com/".
  • Subsequent elements of this array will be subsequent results of submatches. Note you can have multiple submatch () within a regular expression - The BASH_REMATCH elements will correspond to these in order. So in this case ${BASH_REMATCH[1]} will contain "www.google.com", which I think is the string you want.

请注意,BASH_REMATCH 数组的内容仅适用于上次使用正则表达式 =~ 运算符时.因此,如果您继续进行更多的正则表达式匹配,则必须每次都从该数组中保存所需的内容.

Note that the contents of the BASH_REMATCH array only apply to the last time the regular expression =~ operator was used. So if you go on to do more regular expression matches, you must save the contents you need from this array each time.

这似乎是一个冗长的描述,但我确实掩盖了正则表达式的几个复杂之处.它们可以非常强大,我相信性能不错,但正则表达式语法很复杂.正则表达式的实现也各不相同,因此不同的语言将支持不同的功能,并且在语法上可能会有细微的差异.特别是在正则表达式中转义字符可能是一个棘手的问题,尤其是当这些字符在给定语言中具有不同的含义时.

This may seem like a lengthy description, but I have really glossed over several of the intricacies of regular expressions. They can be quite powerful, and I believe with decent performance, but the regular expression syntax is complex. Also regular expression implementations vary, so different languages will support different features and may have subtle differences in syntax. In particular escaping of characters within a regular expression can be a thorny issue, especially when those characters would have an otherwise different meaning in the given language.

请注意,无需在单独的行中设置 $re 变量并在条件中引用此变量,您可以将正则表达式直接放入条件中.然而,在 bash 3.2 中,关于是否在此类文字正则表达式周围使用引号的规则发生了变化是否需要.将正则表达式放在单独的变量中是解决此问题的直接方法,以便条件在支持 =~ 匹配运算符的所有 bash 版本中按预期工作.

Note that instead of setting the $re variable on a separate line and referring to this variable in the condition, you can put the regular expression directly into the condition. However in bash 3.2, the rules were changed regarding whether quotes around such literal regular expressions are required or not. Putting the regular expression in a separate variable is a straightforward way around this, so that the condition works as expected in all bash versions that support the =~ match operator.

这篇关于外壳脚本.如何使用正则表达式提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆