如何获得正则表达式的逆? [英] How to get the inverse of a regular expression?

查看:48
本文介绍了如何获得正则表达式的逆?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个正则表达式可以正确地查找文本文件中的所有 URL:

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:

(http://)([a-zA-Z0-9\/\.])*

如果我想要的不是 URL 而是相反的——除 URL 之外的所有其他文本——是否有一个简单的修改来获得这个?

If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?

推荐答案

如果由于某种原因你需要一个仅使用正则表达式的解决方案,试试这个:

If for some reason you need a regex-only solution, try this:

((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)

我稍微扩展了一组 URL 字符 ([a-zA-Z0-9\/\.#?/%]) 以包含一些重要的字符,但这是由没有精确或详尽的意思.

I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.

正则表达式有点像怪物,所以我会尝试分解它:

The regex is a bit of a monster, so I'll try to break it down:

(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])

第一个药水匹配 URL 的结尾.http://[a-zA-Z0-9\/\.#?/%]+ 匹配 URL 本身,而 (?=[^a-zA-Z0-9\/\.#?/%]) 断言 URL 后面必须跟一个非 URL 字符,这样我们才能确定我们在最后.使用前瞻,以便寻找但不捕获非 URL 字符.整个事情被包裹在一个lookbehind (?<=...) 中,以寻找它作为匹配的边界,同样不捕获该部分.

The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.

我们还想匹配文件开头的非 URL.\A(?!http://[a-zA-Z0-9\/\.#?/%]) 匹配文件的开头 (\A),然后是否定前瞻,以确保文件开头没有潜伏的 URL.(这个 URL 检查比第一个简单,因为我们只需要 URL 的开头,而不是整个内容.)

We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)

这两个检查都放在括号中,并与 | 字符一起进行 OR 运算.之后,.+? 匹配我们要捕获的字符串.

Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.

然后我们来到((?=http://[a-zA-Z0-9\/\.#?/%])|\Z).在这里,我们再次使用 (?=http://[a-zA-Z0-9\/\.#?/%]) 检查 URL 的开头.文件的结尾也是我们已经到达匹配结尾的一个很好的标志,所以我们也应该使用 \Z 寻找它.与第一个大组类似,我们将它用括号括起来,然后将两种可能性OR放在一起.

Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.

| 符号需要括号,因为它的优先级很低,所以你必须明确说明OR 的边界.

The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.

这个正则表达式在很大程度上依赖于零宽度断言(\A\Z 锚点,以及环视组).在将正则表达式用于任何严肃或永久的事情之前,您应该始终了解它(否则您可能会遇到 perl 的情况),因此您可能需要查看 字符串开头和字符串锚的结尾前瞻和后视零宽度断言.

This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.

当然欢迎指正!

这篇关于如何获得正则表达式的逆?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆