正则表达式用于选择性剥离HTML [英] Regex for selective stripping of HTML
问题描述
我试图用PHP解析一些HTML作为练习,将它作为文本输出,并且遇到了一些障碍。我想删除隐藏在 style =display:none;
中的任何标签 - 记住标签可能包含其他属性和样式属性。 p>
我到目前为止的代码是这样的:
$ page = preg_replace ?#≤([AZ] +)*风格= \。?(*显示:?\s *无[^>] *> * LT; / \1>#分别 ,,$ page);`
返回的代码 NULL
与一个 PREG_BACKTRACK_LIMIT_ERROR
。
我试过这个:
$ page = preg_replace(#<([az] +)[^>] *?style = \[^ \] *?display:\s * none [^>] *>。*?< / \ 1> #s,,$ page);
但现在它只是不会取代任何标签。
任何帮助都将非常感谢。
使用 DOMDocument ,你可以尝试类似的东西这:
$ doc = new DOMDocument;
$ doc-> loadHTMLFile(foo.html);
$ nodeList = $ doc-> getElementsByTagName('*');
foreach($ nodeList as $ node){
if(strpos(strtolower($ node-> getAttribute('style')),'display:none')!== false){
$ doc-> removeChild($ node);
}
}
$ doc-> saveHTMLFile(foo.html);
I'm trying to parse some HTML with PHP as an exercise, outputting it as just text, and I've hit a snag. I'd like to remove any tags that are hidden with style="display: none;"
- bearing in mind that the tag may contain other attributes and style properties.
The code I have so far is this:
$page = preg_replace("#<([a-z]+).*?style=\".*?display:\s*none[^>]*>.*?</\1>#s","",$page);`
The code it returning NULL
with a PREG_BACKTRACK_LIMIT_ERROR
.
I tried this instead:
$page = preg_replace("#<([a-z]+)[^>]*?style=\"[^\"]*?display:\s*none[^>]*>.*?</\1>#s","",$page);
But now it's just not replacing any tags.
Any help would be much appreciated. Thanks!
Using DOMDocument, you can try something like this:
$doc = new DOMDocument;
$doc->loadHTMLFile("foo.html");
$nodeList = $doc->getElementsByTagName('*');
foreach($nodeList as $node) {
if(strpos(strtolower($node->getAttribute('style')), 'display: none') !== false) {
$doc->removeChild($node);
}
}
$doc->saveHTMLFile("foo.html");
这篇关于正则表达式用于选择性剥离HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!