shell脚本。如何使用正则表达式提取字符串 [英] shell script. how to extract string using regular expressions

查看:171
本文介绍了shell脚本。如何使用正则表达式提取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新的shell脚本。我想使用curl发送一个http请求,然后使用正则表达式提取一些字符串。例如,如何从http响应中提取域名? (示例仅用于学习目的)

I am new to shell scripts. I want to send a http request using curl and then extract some string using regular expressions. For example, how can I extract a domain name from a http response? (The example is for learning purposes only)

#!/bin/bash
name=$(curl google.com | grep "www\..*com")
echo "domain name is"
echo $name


推荐答案

使用 bash正则表达式

re="http://([^/]+)/"
if [[ $name =~ $re ]]; then echo ${BASH_REMATCH[1]}; fi

编辑 - OP要求解释语法。 正则表达式语法是一个大主题,我不能在这里完全解释,但我会尝试

Edit - OP asked for explanation of syntax. Regular expression syntax is a large topic which I can't explain in full here, but I will attempt to explain enough to understand the example.

re="http://([^/]+)/"

这是存储在bash变量中的正则表达式, re - 即你想要输入的字符串匹配,并希望提取一个子字符串。结束:

This is the regular expression stored in a bash variable, re - i.e. what you want your input string to match, and hopefully extract a substring. Breaking it down:


  • http:// 只是一个字符串 - 输入字符串必须包含此子字符串以匹配正则表达式

  • [] 通常使用方括号匹配括号内的任何字符。因此 c [ao] t 会匹配cat和cot。 [] 中的 ^ 字符将此修改为匹配任何字符因此在这种情况下, [^ /] 将匹配除/\".

  • 之外的任何字符
  • 表达式只匹配一个字符,添加一个 + 到它的末尾说匹配1个或多个前面的子表达式因此 [ ^ /] + 匹配所有字符集中的一个或多个,不包括/\".

  • 放置 c $ c>子表达式中的圆括号表示要保存与该子表达式相匹配的任何内容以供以后处理。如果您使用的语言支持这种方式,它将提供一些机制来检索这些子匹配,对于bash,它是BASH_REMATCH数组。

  • 最后,我们对/进行完全匹配,以确保我们完全匹配完全限定域名的结尾和以下/。

  • http:// is just a string - the input string must contain this substring for the regular expression to match
  • [] Normally square brackets are used say "match any character within the brackets". So c[ao]t would match both "cat" and "cot". The ^ character within the [] modifies this to say "match any character except those within the square brackets. So in this case [^/] will match any character apart from "/".
  • The square bracket expression will only match one character. Adding a + to the end of it says "match 1 or more of the preceding sub-expression". So [^/]+ matches 1 or more of the set of all characters, excluding "/".
  • Putting () parentheses around a subexpression says that you want to save whatever matched that subexpression for later processing. If the language you are using supports this, it will provide some mechanism to retrieve these submatches. For bash, it is the BASH_REMATCH array.
  • Finally we do an exact match on "/" to make sure we match all the way to end of the fully qualified domain name and the following "/"

接下来,我们必须针对正则表达式测试输入字符串,看看它是否匹配。我们可以使用bash条件:

Next, we have to test the input string against the regular expression to see if it matches. We can use a bash conditional to do that:

if [[ $name =~ $re ]]; then
    echo ${BASH_REMATCH[1]}
fi

[[]] 指定扩展条件测试,并且可以包含 =〜 bash正则表达式运算符。在这种情况下,我们测试输入字符串 $ name 是否匹配正则表达式 $ re 。如果它匹配,那么由于正则表达式的构造,我们保证我们将有一个子匹配(从括号()),我们可以访问它使用BASH_REMATCH数组:

In bash, the [[ ]] specify an extended conditional test, and may contain the =~ bash regular expression operator. In this case we test whether the input string $name matches the regular expression $re. If it does match, then due to the construction of the regular expression, we are guaranteed that we will have a submatch (from the parentheses ()), and we can access it using the BASH_REMATCH array:


  • 此数组的元素0 $ {BASH_REMATCH [0]} 将是由正则表达式匹配的整个字符串,即 http://www.google.com/

  • 此数组的后续元素将是子匹配的后续结果。注意,在正则表达式中可以有多个子匹配() - BASH_REMATCH 元素将按顺序对应。因此,在这种情况下, $ {BASH_REMATCH [1]} 将包含www.google.com,我认为是你想要的字符串。

  • Element 0 of this array ${BASH_REMATCH[0]} will be the entire string matched by the regular expression, i.e. "http://www.google.com/".
  • Subsequent elements of this array will be subsequent results of submatches. Note you can have multiple submatch () within a regular expression - The BASH_REMATCH elements will correspond to these in order. So in this case ${BASH_REMATCH[1]} will contain "www.google.com", which I think is the string you want.

请注意,BASH_REMATCH数组的内容只适用于最后一次正则表达式 =〜用过的。因此,如果您继续进行更多正则表达式匹配,您必须每次保存此数组中您需要的内容。

Note that the contents of the BASH_REMATCH array only apply to the last time the regular expression =~ operator was used. So if you go on to do more regular expression matches, you must save the contents you need from this array each time.

像一个冗长的描述,但我已经真正地掩饰了正则表达式的几个复杂。它们可以是相当强大的,我相信具有良好的性能,但正则表达式语法是复杂的。正则表达式实现也各不相同,因此不同的语言将支持不同的功能,并且在语法上可能有微妙的差异。特别是在正则表达式中转义字符可能是一个棘手的问题,特别是当这些字符在给定语言中有不同的含义时。

This may seem like a lengthy description, but I have really glossed over several of the intricacies of regular expressions. They can be quite powerful, and I believe with decent performance, but the regular expression syntax is complex. Also regular expression implementations vary, so different languages will support different features and may have subtle differences in syntax. In particular escaping of characters within a regular expression can be a thorny issue, especially when those characters would have an otherwise different meaning in the given language.

请注意,不是在单独的行上设置 $ re 变量并在条件中引用此变量,您可以将正则表达式直接放入条件。但是,在 bash 3.2 中,更改了有关此类字面正则表达式引用的规则是否需要。将正则表达式放在一个单独的变量中是一个简单的方法,所以条件在支持 =〜匹配运算符的所有bash版本中正常工作。

Note that instead of setting the $re variable on a separate line and referring to this variable in the condition, you can put the regular expression directly into the condition. However in bash 3.2, the rules were changed regarding whether quotes around such literal regular expressions are required or not. Putting the regular expression in a separate variable is a straightforward way around this, so that the condition works as expected in all bash versions that support the =~ match operator.

这篇关于shell脚本。如何使用正则表达式提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆