带有可选前缀的正则表达式中的负向后视 [英] Negative lookbehind in a regex with an optional prefix

查看:52
本文介绍了带有可选前缀的正则表达式中的负向后视的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用以下正则表达式来识别网址(源自 this gist by 吉姆·格鲁伯).这是在 Scala 中使用 scala.util.matching 执行的,而 scala.util.matching 又使用 java.util.regex:

We are using the following regex to recognize urls (derived from this gist by Jim Gruber). This is being executed in Scala using scala.util.matching which in turn uses java.util.regex:

(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!@)))

此版本已转义正斜杠,用于 Rubular:

This version has escaped forward slashes, for Rubular:

(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))

以前前端只向后端发送纯文本,但现在他们允许用户为 url 创建锚标记.因此,后端现在需要识别那些已经在锚标记中的 url except.我最初尝试使用否定的 loohbehind 来完成此操作,忽略带有 href=" 前缀

Previously the front-end was only sending plaintext to the back end, however now they're allowing users to create anchor tags for urls. Therefore the back end now needs to recognize urls except for those that are already in anchor tags. I initially tried to accomplish this with a negative loohbehind, ignoring urls with a href=" prefix

(?i)\b((?<!href=")((?:https?: ... etc

问题是我们的 url regex 非常宽松,可以识别 http://www.google.comwww.google.comgoogle.com - 给定

The problem is that our url regex is very liberal, recognizing http://www.google.com, www.google.com, and google.com - given

 <a href="http://www.google.com">Google</a>

否定的lookbehind 将忽略http://www.google.com,但正则表达式仍会识别www.google.com.我想知道是否有一种简洁的方法来告诉正则表达式忽略 www.google.comgoogle.com 如果它们是被忽略的 http(s)://www.google.com"

the negative lookbehind will ignore http://www.google.com, but then the regex will still recognize www.google.com. I'm wondering if there's a succinct way to tell the regex "ignore www.google.com and google.com if they are substrings of an ignored http(s)://www.google.com"

目前我在 url regex 匹配上使用过滤器(代码在 Scala 中) - 这也会忽略链接文本中的 url (<a href="http://www.google.com">www.google.com</a>) 通过忽略带有 > 前缀和 </a> 后缀的网址.如果在正则表达式中执行此操作会使已经很复杂的正则表达式变得更加不可读,我宁愿坚持使用过滤器.

At present I'm using a filter on the url regex matches (code is in Scala) - this also ignores urls in link text (<a href="http://www.google.com">www.google.com</a>) by ignoring urls with a > prefix and </a> suffix. I'd rather stick with the filter if doing this in a regex would make an already complicated regex even more unreadable.

urlPattern.findAllMatchIn(text).toList.filter(m => {
  val start: Int = m.start(1)
  val end: Int = m.end(1)
  val isHref: Boolean = (start - 6 > 0) && 
    text.substring(start - 6, start) == """href=""""
  val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length && 
    text.substring(start - 1, start) == ">" && 
    text.substring(end, end + 3) == "</a>")
  !(isHref || isAnchor) && Option(m.group(1)).isDefined
})

推荐答案

<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

试试这个.它的主要作用是:

Try this. What it essentially does is:

  1. 消耗所有href链接,以便以后无法匹配

  1. Consumes all href links so that it cannot be matched later

不捕获它,因此无论如何它都不会出现在 groups 中.

Does not capture it so it will not appear in groups anyways.

其余的像以前一样处理.

Process the rest as before.

查看演示.

http://regex101.com/r/vR4fY4/17

这篇关于带有可选前缀的正则表达式中的负向后视的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆