将url转换为来自字符串的链接,除非它们位于html标记的属性中 [英] convert url to links from string except if they are in an attribute of an html tag

查看:112
本文介绍了将url转换为来自字符串的链接,除非它们位于html标记的属性中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从文本区域输入($_POST['content'])转换所有要链接的URL.

I am trying to convert, from a textarea input ($_POST['content']), all urls to link.

$content = preg_replace('!(\s|^)((https?://)+[a-z0-9_./?=&-]+)!i', ' <a href="$2" target="_blank">$2</a> ', nl2br($_POST['content'])." ");
$content = preg_replace('!(\s|^)((www\.)+[a-z0-9_./?=&-]+)!i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");

目标链接格式:www.hello.comhttp(s)://(www).hello.com

但这似乎破坏了所有iframe,图片或类似图片,

But this seem to break any iframe, image or similar,

正确的正则表达式如何/将忽略html标签中的url?

How is/are the right regex that will ignore urls in html tags?

注意:我知道我需要两个表达式;一个不检测协议链接(例如www.hello.com,因此我需要在其前面添加前缀),另一个不检测协议链接的URL(因此无需在其前面添加前缀).

Note: I know I need two expressions; one to detect no protocol links (like www.hello.com, so I need to prepend it) and another one to detect urls with protocol (so no need to prepend).

推荐答案

在iframe中,您的代码本身应该不是什么大问题,依此类推,因为在其中,您通常在URL前面有一个"而不是您的图案所需要的空格.

Your code as it is should not be much of a problem within iframes and so on, because in there you usually have a " in front of your URL and not a space, as your pattern requires.

但是,这是不同的解决方案.如果您在HTML注释或类似内容中只有一个<>,则可能无法100%正常工作.但是在任何其他情况下,它应该都能很好地为您服务(无论您是否遇到问题,我都不会).它使用否定的前瞻方式来确保在任何打开<之前都没有关闭>(因为这意味着您在标签内).

However, here is different solution. It might not work 100% if you have single < or > within HTML comments or something similar. But in any other case, it should server you well (and I do not whether this is a problem for you or not). It uses a negative lookahead to make sure that there is no closing > before any opening < (because this means, you are inside a tag).

$content = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$2" target="_blank">$2</a> ', $content." ");
$content = preg_replace('$(\s|^)(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$2"  target="_blank">$2</a> ', $content." ");

如果您不熟悉这项技术,这里将进行详细说明.

In case you are not familiar with this technique, here is a bit more elaboration.

(?!        # starts the lookahead assertion; now your pattern will only match, if this subpattern does not match
[^<>]      # any character that is neither < nor >; the > is not strictly necessary but might help for optimization
*          # arbitrary many of those characters (but in a row; so not a single < or > in between)
>          # the closing >
)          # ends the lookahead subpattern

请注意,我更改了正则表达式定界符,因为我现在在正则表达式中使用了!.

Note that I changed the regex delimiters, because I am now using ! within the regex.

除非您还需要第一个子模式(\s|^)作为标记之外的URL,您现在也可以将其删除(并减少替换中的捕获变量).

Unless you need the first subpattern (\s|^) for the URLs outside of tags as well, you can now remove that, too (and decrease the capture variables in the replacement).

$content = preg_replace('$(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

最后,您是否不想替换结尾处包含锚的URL?例如. www.hello.com/index.html#section1?如果您无意中错过了此操作,请在允许的URL字符中添加#:

And lastly... do you intend not to replace URLs that contain anchors at the end? E.g. www.hello.com/index.html#section1? If you missed this by accident, add the # to your allowed URL characters:

$content = preg_replace('$(https?://[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', ' <a href="$1" target="_blank">$1</a> ', $content." ");
$content = preg_replace('$(www\.[a-z0-9_./?=&#-]+)(?![^<>]*>)$i', '<a target="_blank" href="http://$1"  target="_blank">$1</a> ', $content." ");

编辑:此外,+%呢?还有一些其他字符允许未经编码就出现在URL中. 查看此内容. 结束编辑

Also, what about + and %? There are also a few other characters that are allowed to appear in a URL without being encoded. See this. END OF EDIT

我认为这应该为您解决问题.但是,如果您可以提供一个示例来显示有效的URL和损坏的URL(以及您拥有的代码),我们实际上可以提供经过测试可用于您所有情况的解决方案.

I think this should do the trick for you. However, if you could provide an example that shows working and broken URLs (with the code you have), we could actually provide solutions that are tested to work for all of your cases.

最后一个想法.正确的解决方案是使用DOM解析器.然后,您可以简单地将仅具有的正则表达式应用于文本节点.但是,您对HTML结构的关注非常有限,这使您的问题再次成为常规问题(只要页面上的HTML注释或JavaScript或CSS中没有匹配的'<'或'>').如果确实有这些特殊情况,则应真正研究DOM解析器.在这种情况下,到目前为止(此处提供)的解决方案都不安全.

One final thought. The proper solution would be to use a DOM parser. Then you could simply apply the regex you already have only to text nodes. However, your concern for the HTML structure is very restricted, and that makes your problem regular again (as long as you do not have unmatched '<' or '>' in HTML comments or JavaScript or CSS on the page). If you do have those special cases, you should really look into a DOM parser. None of the solutions presented here (so far) will be safe in that case.

这篇关于将url转换为来自字符串的链接,除非它们位于html标记的属性中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆