Javascript正则表达式:查找< a>之外的所有网址标签 - 嵌套标签 [英] Javascript regex: Find all URLs outside <a> tags - Nested Tags
问题描述
((https?| ftps?):\ / \ / / [ ^< \ s] +)(?![^<>]??> | [^<>]?? >
第一组捕获HTML中的所有链接,第二组负责将标签内的任何部分作为属性和标签内的任何部分作为内容排除。
我希望只有< a>
标签被排除在外 - 所以解决办法可能只是修改最后一项为:
[^<>] *?< \ / a>
code>
但是现在如果我有嵌套的标签,会出现问题,例如< b>< ; / b>
在< a>
中。
以下是示例I我正在研究: https://regex101.com/r/lM3hC5/6 (应该是10场比赛)。
对我而言,负向前瞻对我来说仍然很棘手。工作,但它不是:
(?!< a。+?< \ / a>)
https://regex101.com/r/hT1cG5/1
这些是帮助我的最后一次讨论:
事实证明,最好的解决方案如下:
((HTTPS | FTPS):\?/ \ / [^ < \s] +)?!([^<>] * GT; | [^] * ?< \ / a)
看起来,负向预测只能正常工作它始于量词而不是字符串。对于这种情况,实际上我们可以只做回溯。
我们只是想确保HTML标签内部没有任何属性会被搞乱。然后我们从< / a
开始回溯到第一个符号(因为它不是有效的URL符号,但
<>
符号与嵌套标记一起存在)。
现在嵌套< c $ c>< a> 标签可以正确找到,当然,这些代码并不完美,但它应该可以用于任何简单的HTML标记中,只需要注意以下几点: $ b
- 在
< a>
标签内放置引号; - 不要在没有任何属性的
< a>
标签中使用此算法(占位符); / code>标记在任何双引号之后。
这是一个很好的凌乱的例子(最后一场比赛不应该被发现,但它是):
https://regex101.com/r/pC0jR7/2
很可惜这个前瞻不起作用:(?!< a。*?< \ / a>)
I have built this regex code:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*?>|[^<>]*?<\/)
The first group captures all links in HTML and the second is a negative lookahead to exclude any parts inside tags as attributes and any parts inside tags as content.
I would like that only <a>
tags are excluded - so the solution could be to modify only the last term to:
[^<>]*?<\/a>
But now there will be a problem if I have nested tags, for example, <b></b>
inside <a>
.
Here is the example I am working on: https://regex101.com/r/lM3hC5/6 (should be 10 matches).
Negative lookahead is still tricky for me. I thought that the following should work but it isn't:
(?!<a.+?<\/a>)
https://regex101.com/r/hT1cG5/1
These are the last discussions that helped me:
It turned out that probably the best solution is the following:
((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
Looks like that the negative lookahead is working properly only if it starts with quantifiers and not strings. For such a case, it follows that practically we can do backtracks only.
Again, we just want to make sure that nothing inside HTML tags as attributes is messed up. Then we do a backtrack starting from </a
up to the first "
symbol (as it is not a valid URL symbol but <>
symbols are present with nested tags).
Now also nested tags inside <a>
tags are found properly. Of course, the code is not perfect but it should work with almost any simple HTML markup. Just you may need to be a bit careful with:
- placing quotes within
<a>
tags; - do not use this algorithm on
<a>
tags without any attribute (placeholders); - as well as you may need to avoid using multiple nested tags/lines unless the URL inside
<a>
tag is after any double quote.
Here is a very good and messy example (the last match should not be found but it is):
https://regex101.com/r/pC0jR7/2
It is a pity that this lookahead does not work: (?!<a.*?<\/a>)
这篇关于Javascript正则表达式:查找< a>之外的所有网址标签 - 嵌套标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!