如何替换文本网址并排除HTML标记中的网址? [英] How to replace text URLs and exclude URLs in HTML tags?

查看:97
本文介绍了如何替换文本网址并排除HTML标记中的网址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



$ b

 <$ 
$ b

c $ c> sometext sometext http://www.somedomain.com/index.html sometext sometext

到:

  sometext sometext< a href =http://somedoamai.com/index.html> www。 somedomain.com/index.html</a> sometext sometext 

我使用这个正则表达式来管理它:


$ b $(< pre> preg_replace(#((http | https | ftp)://(\S *?\\S *?))(\s | \; | \)| \] | \ [| \ {| \} |,| \|'|:| \< | $ | \.\s)#ie ,'< a href = \$ 1 \target = \_blank\> $ 1< / a> $ 4',$ text);

问题在于它也替换了 img URL,例如:

  sometext sometext< img src =http // domain.com / image.jpg> sometext sometext 
<



变成: > sometext sometext< img src =< a href =http // domain.com / image.jpg> domain.com/image.jpg< / a>> sometext sometext

请帮助。

上面的Gumbo简化版:

  $ html =<<<< HTML 
< html>
< body>
< p>
这是一个包含< a href =http://example.com/1>链路< / A>
和另一个< a href =http://example.com/2> http://example.com/2< / a>
以及另一个http://example.com,后者是
,应该被替换。此
文字中还有图片,例如< img src =http://example.com/foo/>但这些应该
也不能被替换。实际上,只有文本中不包含
作为锚元素后代的URL才能转换为链接。
< / p>
< / body>
< / html>
HTML;

让我们使用一个XPath,它只提取那些实际上是包含http://或https: //或者ftp://并且它们本身不是锚元素的textnode。

  $ dom = new DOMDocument; 
$ dom-> loadHTML($ html);
$ xPath = new DOMXPath($ dom);
$ texts = $ xPath-> query(
'/ html / body // text()[
not(ancestor :: a)and(
contains(。, http://)或
包含(。,https://)或
包含(。,ftp://))]'
);

上面的XPath会为我们提供一个包含以下数据的TextNode:

 ,另外一个是http://example.com,后者是
,应该被替换。在这个
文本中也有图像,例如

从PHP5.3开始,我们也可以<在XPath中使用PHP 使用正则表达式模式来选择我们的节点,而不是使用正则表达式三个呼叫中包含。



除了符合标准的方式拆分textnode外,我们将使用文档片段,然后用片段替换整个文本节点。在这种情况下,非标准只意味着,我们将用于此的方法,不是 DOM API的W3C规范的一部分。

  foreach($ texts as $ text){
$ fragment = $ dom-> createDocumentFragment();
$ fragment-> appendXML(
preg_replace(
〜((?:http | https | ftp)://(?:\ S *?\.\S ?*))(= \s | \; | \)| \] | \ [| \ {| \} |,| \|'|:| \< | $ |
'< a href =$ 1> $ 1< / a>',
$ text->数据

);
$ text-> parentNode-> replaceChild($ fragment,$ text);
}
echo $ dom-> saveXML($ dom-> ; documentElement);

然后输出:

 < html>< body> 
< p>
这是一个带有< a href =http://示例的文本.com / 1>连结< / a>
及另一个< a href =http://example.com/2> http://example.com/2< / a>
,另外还有一个< a href =http://example.com> http://example.com< / a>后者是
,应该被替换。也是这个
文字中的图片,例如< img sr c =http://example.com/foo/>但是这些应该
也不能被替换。实际上,只有文本中不包含
作为锚元素后代的URL才能转换为链接。
< / p>
< / body>< / html>


I need you help here.

I want to turn this:

sometext sometext http://www.somedomain.com/index.html sometext sometext

into:

sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext

I have managed it by using this regex:

preg_replace("#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie", "'<a href=\"$1\" target=\"_blank\">$1</a>$4'", $text);

The problem is it’s also replacing the the img URL, for example:

sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext

is turned into:

sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext

Please help.

解决方案

Streamlined version of Gumbo's above:

$html = <<< HTML
<html>
<body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body>
</html>
HTML;

Let's use an XPath that only fetches those elements that actually are textnodes containing http:// or https:// or ftp:// and that are not themselves textnodes of anchor elements.

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$texts = $xPath->query(
    '/html/body//text()[
        not(ancestor::a) and (
        contains(.,"http://") or
        contains(.,"https://") or
        contains(.,"ftp://") )]'
);

The XPath above will give us a TextNode with the following data:

 and also another http://example.com with the latter being the
    only one that should be replaced. There is also images in this
    text, like 

Since PHP5.3 we could also use PHP inside the XPath to use the Regex pattern to select our nodes instead of the three calls to contains.

Instead of splitting the textnodes apart in the standards compliant way, we will use a document fragment and just replace the entire textnode with the fragment. Non-standard in this case only means, the method we will be using for this, is not part of the W3C specification of the DOM API.

foreach ($texts as $text) {
    $fragment = $dom->createDocumentFragment();
    $fragment->appendXML(
        preg_replace(
            "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i",
            '<a href="$1">$1</a>',
            $text->data
        )
    );
    $text->parentNode->replaceChild($fragment, $text);
}
echo $dom->saveXML($dom->documentElement);

and this will then output:

<html><body>
<p>
    This is a text with a <a href="http://example.com/1">link</a>
    and another <a href="http://example.com/2">http://example.com/2</a>
    and also another <a href="http://example.com">http://example.com</a> with the latter being the
    only one that should be replaced. There is also images in this
    text, like <img src="http://example.com/foo"/> but these should
    not be replaced either. In fact, only URLs in text that is no
    a descendant of an anchor element should be converted to a link.
</p>
</body></html>

这篇关于如何替换文本网址并排除HTML标记中的网址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆