忽略preg_replace中的html标签 [英] Ignore html tags in preg_replace

查看:107
本文介绍了忽略preg_replace中的html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在此preg_replace中忽略HTML标记
我有一个用于搜索的foreach函数,所以如果有人搜索apple span,preg_replace也会将跨度应用于跨度和html中断:

  preg_replace(/($ keyword)/ i,< span class = \search_hightlight \> $ 1< / span>,$ str); 

提前致谢!

解决方案

我假设你应该基于DOMDocument和DOMXPath而不是使用正则表达式来创建函数。即使那些功能非常强大,你遇到的问题就像你所描述的问题一样,它们不能(总是)很容易和健壮以解决正则表达式问题。



一般的说法是:不要用正则表达式解析HTML。



这是一条很好的规则,尽管与任何规则一样,它并不总是适用,值得补充
$ b XPath允许您在文本中查找包含搜索词的所有文本,忽略所有XML元素。



然后,您只需将这些文本包装到< span> 中即可。



编辑:最后是一些代码;)首先它使用 xpath 查找包含搜索文本的元素。我的查询看起来像这样,这可能写得更好,我不是一个超级xpath专业版:

 '// *包含(。,'。$ search。')] / * [FALSE = contains(。,''。$ search。')] / ..'
pre>

$ search 包含要搜索的文本, not 包含任何(quote)character(this will break it,see 清理/清理xpath属性如果您需要引号,则使用a>作为解决方法)。



该查询将返回包含textnode的所有父项,这些父项放在一起将是包含搜索项的字符串。 p>

由于这样的列表不容易进一步处理,我创建了一个 TextRange 类,它代表 DOMText 节点。对文本节点列表进行字符串操作就好像它们是一个字符串一样。



这是例程的基本框架:

  $ str =' ...';#some XML 

$ search ='text that span';

printf(正在搜索:(%d)'%s'\\\
,strlen($ search),$ search);

$ doc = new DOMDocument;
$ doc-> loadXML($ str);
$ xp = new DOMXPath($ doc);

$ anchor = $ doc-> getElementsByTagName('body') - > item(0);
if(!$ anchor)
{
throw new Exception('Anchor element not found。');


包含搜索文本的搜索元素
$ r = $ xp-> query('// * [contains(。,''。$搜索。')] / * [FALSE = contains(。,''。$ search。')] / ..',$ anchor);
if(!$ r)
{
抛出新异常('XPath失败。');
}

//处理搜索结果
foreach($ r as $ i => $ node)
{
$ textNodes = $ xp- > query('。// child :: text()',$ node);

//提取$ search textnode范围,根据需要创建拟合节点
$ range = new TextRange($ textNodes);
$ ranges = array();
while(FALSE!== $ start = strpos($ range,$ search))
{
$ base = $ range-> split($ start);
$ range = $ base-> split(strlen($ search));
$ ranges [] = $ base;
};

//包装每个匹配的textnode
foreach($ ranges为$ range)
{
foreach($ range-> getNodes()as $ node)
{
$ span = $ doc-> createElement('span');
$ span-> setAttribute('class','search_hightlight');
$ node = $ node-> parentNode-> replaceChild($ span,$ node);
$ span-> appendChild($ node);


$ b code
$ b

对于我的示例XML:


$ b $ / p>

 < html> 
< body>
这是一些< span>文字< / span>跨越一个页面进行搜索。
和更多跨越< / body>的文本
< / html>

产生以下结果:

 < HTML> 
< body>
这是一些< span>< span class =search_hightlight>文字< / span>< / span>< span class =search_hightlight>>跨度< / span>跨页搜索英寸
和更多< span class =search_hightlight>跨越文本< / span>< / body>
< / html>

这表明这甚至可以找到分布在多个标签上的文本。这对于正则表达式来说并不容易。



您可以在这里找到完整的代码: http://codepad.viper-7.com/U4bxbe (包括我已经拿掉的 TextRange 类)答案示例)。

由于该网站使用的是较旧的LIBXML版本,因此无法在viper键盘上正常工作。它适用于我的LIBXML 20707版本。我创建了一个有关此问题的相关问题: XPath查询结果顺序



警告:本示例使用二进制字符串搜索( strpos )和相关的偏移量用于将文本节点与 DOMText :: splitText mb_strpos 来获取基于 UTF-8 的值。



无论如何,该示例工作正常,因为它只使用与 UTF-8 具有相同偏移量的 US-ASCII code> for example-data。



对于现实生活中的情况, $ search 字符串应该使用UTF-8编码,并使用 mb_strpos 来代替 strpos

  while(FALSE!== $ start = mb_strpos($ range,$ search,0,'UTF-8'))


How do I ignore html tags in this preg_replace. I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:

preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);

Thanks in advance!

解决方案

I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.

The general saying is: Don't parse HTML with regular expressions.

It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.

XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.

Then you only need to wrap those texts into the <span> and you're done.

Edit: Finally some code ;)

First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:

'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'

$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).

This query will return all parents that contain textnodes which put together will be a string that contain your search term.

As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.

This is the base skeleton of the routine:

$str = '...'; # some XML

$search = 'text that span';

printf("Searching for: (%d) '%s'\n", strlen($search), $search);

$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);

$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
    throw new Exception('Anchor element not found.');
}

// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
    throw new Exception('XPath failed.');
}

// process search results
foreach($r as $i => $node)
{   
    $textNodes = $xp->query('.//child::text()', $node);

    // extract $search textnode ranges, create fitting nodes if necessary
    $range = new TextRange($textNodes);        
    $ranges = array();
    while(FALSE !== $start = strpos($range, $search))
    {
        $base = $range->split($start);
        $range = $base->split(strlen($search));
        $ranges[] = $base;
    };

    // wrap every each matching textnode
    foreach($ranges as $range)
    {
        foreach($range->getNodes() as $node)
        {
            $span = $doc->createElement('span');
            $span->setAttribute('class', 'search_hightlight');
            $node = $node->parentNode->replaceChild($span, $node);
            $span->appendChild($node);
        }
    }
}

For my example XML:

<html>
    <body>
        This is some <span>text</span> that span across a page to search in.
    and more text that span</body>
</html>

It produces the following result:

<html>
    <body>
        This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
    and more <span class="search_hightlight">text that span</span></body>
</html>

This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.

You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).

It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.

A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.

The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.

For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:

 while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

这篇关于忽略preg_replace中的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆