巴什正则表达式匹配单词边界 [英] Bash regex match with word boundary

查看:234
本文介绍了巴什正则表达式匹配单词边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想匹配下面前$ P $在bash pssion:

<$c$c>^.*(\\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\\b).*$

真的所有我想知道的是测试字符串的一个词是在这个正则表达式描述的词之一( 720P 1080 brrip ,...)。并且似乎有与字边界的问题。

我使用的测试是 [$名=〜$重新]&放大器;&安培;回声是,其中 $名称是任何字符串和 $重是我的正则表达式的前pression。

我是什么失踪?


解决方案

\\ b 是一个PCRE扩展;它不是在POSIX ERE(扩展的正前pressions),这可能是最小的一套语法提供了 =〜运营商在bash的 [[]] 将荣誉。 (单个操作系统可能具有延伸这个语法libc的;在这种情况下,这些扩展将可在这样的操作系统,但不可以在支持bash的所有平台)。

作为基准,在 \\ b 延长实际上并没有很多的前pressive力量 - 你可以写一个使用它作为一个等同的任何PCRE ERE。好,虽然是退后一步,质疑的基本假设:当你说单词边界,你是什么究竟意味着什么?如果所有你关心的是,如果这个开始,无论是与空格或开头或字符串的结尾结束,那么你不需要 \\ b 运营商都:

<$p$p><$c$c>(^|[[:space:]])((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))($|[[:space:]])

请注意,我拿出了最初的 ^ * 和结束 * $ ,因为这些结构是自做一个,否则,未锚定的比赛时-negating;在。* ,使 ^ 紧随precedes它毫无意义,同样的 * 之前的最后 $


现在,如果你想要一个的确切的等同于 \\ b 当一个字字符之前在序列的开始立即放置,那么我们得到的东西更像是:

 (^ | [^ A-ZA-Z0-9_])

...而且,同样地,立即当在序列的端部的单词字符后:

 ($ | [^ A-ZA-Z0-9_])

这两个都有些退化情况 - 还有其他的情况下模拟的 \\ b 在ERE的行为可能会更加复杂 - 但他们是唯一的情况下你的问题显得present。

请注意使得b 的 \\一些实现将对非ASCII字符集的更好的支持,从而与更好地描述[^ [: alnum:] _] ,而不是 [^ A-ZA-Z0-9 _] ,但它不是在这里明确你是哪个实现来自或反对的比较。

I would like to match the following expression in bash:

^.*(\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\b).*$

Really all I want to know is whether one of the words of the string tested is one of the words described in this regex (720p, 1080p, brrip, ...). And there seems to be an issue with the word boundaries.

The test I use is [[ $name =~ $re ]] && echo "yes"where $name is any string and $re is my regex expression.

What am I missing?

解决方案

\b is a PCRE extension; it isn't available in POSIX ERE (Extended Regular Expressions), which is the smallest possible set of syntax that the =~ operator in bash's [[ ]] will honor. (An individual operating system may have a libc which extends this syntax; in this case those extensions will be available on such operating systems, but not on all platforms where bash is supported).

As a baseline, the \b extension doesn't actually have very much expressive power -- you can write any PCRE that uses it as an equivalent ERE. Better, though, is to step back and question the underlying assumptions: When you say "word boundary", what do you really mean? If all you care about is that if this starts and ends either with whitespace or the beginning or end of the string, then you don't need the \b operator at all:

(^|[[:space:]])((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))($|[[:space:]])

Note that I took out the initial ^.* and ending .*$, since those constructs are self-negating when doing an otherwise-unanchored match; the .* makes the ^ that immediately precedes it meaningless, and likewise the .* just before the final $.


Now, if you want an exact equivalent to \b when placed immediately before a word character at the beginning of a sequence, then we get something more like:

(^|[^a-zA-Z0-9_])

...and, likewise, when immediately after a word character at the end of a sequence:

($|[^a-zA-Z0-9_])

Both of these are somewhat degenerate cases -- there are other situations where emulating the behavior of \b in ERE can be more complicated -- but they're the only situations your question appears to present.

Note that some implementations of \b would have better support for non-ASCII character sets, and thus be better described with [^[:alnum:]_] rather than [^a-zA-Z0-9_], but it's not well-defined here which implementation you're coming from or comparing against.

这篇关于巴什正则表达式匹配单词边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆