PHP DOM - 剥离span标签,留下他们的内容 [英] PHP DOM - stripping span tags, leaving their contents

查看:139
本文介绍了PHP DOM - 剥离span标签,留下他们的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找标记:

<span class="test">Some text that is <strong>bolded</strong> and contains a <a href="#">link</a>.</span>

,并在PHP中找到最好的方法来取消跨度,以便剩下的是: p>

and find the best method in PHP for stripping the span so that what is left is this:

Some text that is <strong>bolded</strong> and contains a <a href="#">link</a>.

我已经阅读了许多关于使用PHP DOM而不是正则表达式解析HTML的其他问题,但已经无法找出一种使用PHP DOM剥离跨度的方式,使HTML内容保持不变。最终目标是能够剥离所有跨标签的文档,留下其内容。这可以用PHP DOM完成吗?有没有方法提供更好的性能,不依赖于字符串解析而不是DOM解析?

I have read many of the other questions regarding parsing HTML using PHP DOM instead of regex, but have been unable to figure out a way to strip the spans with PHP DOM, leaving the HTML contents intact. The ultimate goal is to be able to strip the document of all span tags, leaving their contents. Can this be done with PHP DOM? Is there a method that provides better performance and does not rely on string parsing instead of DOM parsing?

我已经使用正则表达式来完成,没有任何问题:

I've used regex to do so, without any issues thus far:

/<(\/)?(span)[^>]*>/i

但我的兴趣在于成为一个更好的PHP程序员。而且由于总是有可能使用格式不正确的标记来修正正则表达式,所以我正在寻找更好的方法。我也考虑过使用strip_tags(),执行以下操作:

But my interest here is in becoming a better PHP programmer. And since it is always possible to trip up a regex with badly formatted markup, I'm looking for a better way. I have also considered using strip_tags(), doing something like the following:

public function strip_tags( $content, $tags_to_strip = array() )
{
    // All Valid XHTML tags
 $valid_tags = array(
  'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
  'code','col','colgroup','dd','del','dfn','div','dl','DOCTYPE','dt','em','fieldset','form','h1','h2','h3','h4',
  'h5','h6','head','html','hr','i','img','input','ins','kbd','label','legend','li','link','map','meta','noscript',
  'object','ol','optgroup','option','p','param','pre','q','samp','script','select','small','span','strong','style',
  'sub','sup','table','tbody','td','textarea','tfoot','th','thead','title','tr','tt','ul','var'
 );

    // Remove each tag to strip from the valid_tags array
 foreach ( $tags_to_strip as $tag ){
  $ndx = array_search( $tag, $valid_tags );
  if ( $ndx !== false ){
   unset( $valid_tags[ $ndx ] );
  }
 }

    // convert valid_tags array into param for strip_tags
 $valid_tags = implode( '><', $valid_tags );
 $valid_tags = "<$valid_tags>";

 $content = strip_tags( $content, $valid_tags );
 return $content;
}

但是这仍然解析字符串,而不是DOM解析。所以如果文本是错误的,可以剥离太多。许多人都很快建议使用简单的HTML DOM解析器,但是看源代码,似乎在使用正则表达式也解析html。

But this is still parsing the string, and not DOM parsing. So if the text is mal-formed, it is possible to strip too much. Many people are quick to suggest using Simple HTML DOM Parser, but looking at the source code, it seems to be using regex to parse the html as well.

这可以用PHP5的DOM完成,还是有更好的方法来剥离标签,使其内容不变。使用Tidy或 HTML Purifier 来清理文本,然后使用正则表达式/ HTML简单HTML DOM解析器是不好的做法这是吗?

Can this be done with PHP5's DOM, or is there a better way to strip tags leaving their contents intact. Would it be bad practice to use Tidy or HTML Purifier to clean the text and then use regex / HTML Simple HTML DOM parser on it?

phpQuery 这样的库似乎太重了,似乎应该是一个简单的任务。

Libraries like phpQuery seem to be too heavy weight for what seems like it should be a simple task.

推荐答案

我使用以下功能来删除节点,而不删除其子节点:

I use the following function to remove a node without removing its children:

function DOMRemove(DOMNode $from) {
    $sibling = $from->firstChild;
    do {
        $next = $sibling->nextSibling;
        $from->parentNode->insertBefore($sibling, $from);
    } while ($sibling = $next);
    $from->parentNode->removeChild($from);    
}

每个例子:

$dom = new DOMDocument;
$dom->load('myhtml.html');

$nodes = $dom->getElementsByTagName('span');
foreach ($nodes as $node) {
    DOMRemove($node);
}
echo $dom->saveHTML();

会给你:

Some text that is <strong>bolded</strong> and contains a <a href="#">link</a>.

虽然这样:

$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
    DOMRemove($node);
}
echo $dom->saveHTML();

会给你:

<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>

这篇关于PHP DOM - 剥离span标签,留下他们的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆