用Bash正则表达式匹配单词边界 [英] Matching word boundary with Bash regex

查看:240
本文介绍了用Bash正则表达式匹配单词边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在bash中匹配以下表达式:

I would like to match the following expression in bash:

^.*(\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\b).*$

我只想知道测试的字符串中的单词之一是否是此正则表达式中描述的单词之一(720p1080pbrrip,...).边界一词似乎有问题.

Really all I want to know is whether one of the words of the string tested is one of the words described in this regex (720p, 1080p, brrip, ...). And there seems to be an issue with the word boundaries.

我使用的测试是[[ $name =~ $re ]] && echo "yes",其中$name是任何字符串,而$re是我的正则表达式.

The test I use is [[ $name =~ $re ]] && echo "yes"where $name is any string and $re is my regex expression.

我想念什么?

推荐答案

\b是PCRE扩展;它在POSIX ERE(扩展的正则表达式)中不可用,这是bash [[ ]]中的=~运算符支持的最小语法集. (单个操作系统可能具有扩展此语法的libc;在这种情况下,这些扩展将在此类操作系统上可用,但在所有支持bash的平台上都可用).

\b is a PCRE extension; it isn't available in POSIX ERE (Extended Regular Expressions), which is the smallest possible set of syntax that the =~ operator in bash's [[ ]] will honor. (An individual operating system may have a libc which extends this syntax; in this case those extensions will be available on such operating systems, but not on all platforms where bash is supported).

作为基准,\b扩展实际上没有很大的表达能力-您可以编写将其用作等效ERE的任何PCRE.不过,更好的是退后一步并质疑基本假设:当您说单词边界"时,您真正的意思是什么?如果您只关心如果它以空格开头或以字符串开头或结尾或字符串的开头或结尾,那么您根本就不需要\b运算符:

As a baseline, the \b extension doesn't actually have very much expressive power -- you can write any PCRE that uses it as an equivalent ERE. Better, though, is to step back and question the underlying assumptions: When you say "word boundary", what do you really mean? If all you care about is that if this starts and ends either with whitespace or the beginning or end of the string, then you don't need the \b operator at all:

(^|[[:space:]])((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))($|[[:space:]])

请注意,我取出了开头的^.*和结尾的.*$,因为在进行其他未匹配的匹配时,这些构造是自负的. .*使得紧随其后的^毫无意义,同样,.*也恰好位于最后一个$之前.

Note that I took out the initial ^.* and ending .*$, since those constructs are self-negating when doing an otherwise-unanchored match; the .* makes the ^ that immediately precedes it meaningless, and likewise the .* just before the final $.

现在,如果您希望在序列的开头紧接单词字符之前的 exact \b等效,那么我们会得到更多类似的信息:

Now, if you want an exact equivalent to \b when placed immediately before a word character at the beginning of a sequence, then we get something more like:

(^|[^a-zA-Z0-9_])

...,同样,当在序列末尾紧接单词字符之后:

...and, likewise, when immediately after a word character at the end of a sequence:

($|[^a-zA-Z0-9_])

这两种情况都是简并的情况-在其他情况下,在ERE中模拟\b的行为可能会更复杂-但它们是您的问题似乎仅有的情况.

Both of these are somewhat degenerate cases -- there are other situations where emulating the behavior of \b in ERE can be more complicated -- but they're the only situations your question appears to present.

请注意,\b的某些实现将更好地支持非ASCII字符集,因此可以更好地用[^[:alnum:]_]而不是[^a-zA-Z0-9_]进行描述,但是此处未明确定义要使用的实现来自或与之比较.

Note that some implementations of \b would have better support for non-ASCII character sets, and thus be better described with [^[:alnum:]_] rather than [^a-zA-Z0-9_], but it's not well-defined here which implementation you're coming from or comparing against.

这篇关于用Bash正则表达式匹配单词边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆