Javascript 正则表达式:查找 <a> 之外的所有 URL标签 - 嵌套标签 [英] Javascript regex: Find all URLs outside <a> tags - Nested Tags

查看：37 发布时间：2022/1/2 8:25:22 javascript html regex hyperlink nested

本文介绍了Javascript 正则表达式:查找 <a> 之外的所有 URL标签 - 嵌套标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经构建了这个正则表达式代码:

((https?|ftps?)://[^"<s]+)(?![^<>]*?>|[^<>]*?

第一组捕获 HTML 中的所有链接，第二组是否定前瞻，以排除标记内的任何部分作为属性和标记内的任何部分作为内容.

我希望只排除 <a> 标签 - 因此解决方案可能是仅将最后一项修改为:

[^<>]*?

但是现在如果我有嵌套标签就会有问题，比如</b>在里面.>

这是我正在处理的示例:https://regex101.com/r/lM3hC5/6(应该是 10 个匹配项).

负前瞻对我来说仍然很棘手.我认为以下应该有效，但事实并非如此:

(?!)

https://regex101.com/r/hT1cG5/1

这些是对我有帮助的最后讨论:

解决方案

事实证明，可能最好的解决方案如下:

((https?|ftps?)://[^"<s]+)(?![^<>]*>|[^"]*?</一个)

看起来只有当它以量词开头时，负前瞻才能正常工作而不是字符串.对于这种情况，实际上我们只能进行回溯.

同样，我们只是想确保 HTML 标签中的任何内容都不会被弄乱.然后我们从 </a 开始回溯到第一个 " 符号(因为它不是有效的 URL 符号而是 <>code> 符号与嵌套标签一起存在).

现在还可以正确找到 <a> 标签内的嵌套标签.当然，代码并不完美，但它几乎可以与任何简单的 HTML 标记一起使用.只是你可能需要小心一点:

在标签内放置引号；
不要在没有任何属性(占位符)的<a>标签上使用此算法);
以及您可能需要避免使用多个嵌套标签/行，除非 <a> 标签内的 URL 位于任何双引号之后.

这是一个非常好的和凌乱的例子(不应该找到最后一个匹配，但它是):

https://regex101.com/r/pC0jR7/2

很遗憾，这个lookahead不起作用:(?!)

I have built this regex code:

((https?|ftps?)://[^"<s]+)(?![^<>]*?>|[^<>]*?</)

The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.

I would like that only <a> tags are excluded - so the solution could be to modify only the last term to:
[^<>]*?</a>
But now there will be a problem if I have nested tags, for example, <b></b> inside <a>.

Here is the example I am working on: https://regex101.com/r/lM3hC5/6 (should be 10 matches).

Negative lookahead is still tricky for me. I thought that the following should work but it isn't:
(?!<a.+?</a>)
https://regex101.com/r/hT1cG5/1

These are the last discussions that helped me:

Regex replace text outside html tags

Regex replace text but exclude when text is between specific tag

解决方案
It turned out that probably the best solution is the following:
((https?|ftps?)://[^"<s]+)(?![^<>]*>|[^"]*?</a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.

Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a up to the first " symbol (as it is not a valid URL symbol but <> symbols are present with nested tags).

Now also nested tags inside <a> tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:

placing quotes within <a> tags;

do not use this algorithm on <a> tags without any attribute (placeholders);

as well as you may need to avoid using multiple nested tags/lines unless the URL inside <a> tag is after any double quote.

Here is a very good and messy example (the last match should not be found but it is):

https://regex101.com/r/pC0jR7/2

It is a pity that this lookahead does not work: (?!<a.*?</a>)

这篇关于Javascript 正则表达式:查找 <a> 之外的所有 URL标签 - 嵌套标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Javascript 正则表达式:查找 <a> 之外的所有 URL标签 - 嵌套标签 [英] Javascript regex: Find all URLs outside <a> tags - Nested Tags

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Javascript 正则表达式:查找 &lt;a&gt; 之外的所有 URL标签 - 嵌套标签 [英] Javascript regex: Find all URLs outside &lt;a&gt; tags - Nested Tags

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

Javascript 正则表达式:查找 <a> 之外的所有 URL标签 - 嵌套标签 [英] Javascript regex: Find all URLs outside <a> tags - Nested Tags

登录关闭