在PHP中从HTML中提取所有文本和img标签。 [英] Extract all the text and img tags from HTML in PHP.
问题描述
对于一个项目,我需要一个HTML页面,并从中提取它的所有文本和img标签,并保持它们在网页中出现的顺序。
例如,如果网页为:
< p>嗨< p>
< img src =test.png/>
< a href =test.com>< img src =test2.png/>< / a>
我想用这种格式检索这些信息:
text - Hi
Link1 - < a href =test.com>文字连结< / a>没有alt或其他标记的通知
Img1 - test.png
Link2 - < a href =test.com>< img src =test2.png/>< / a> ;再次没有标签
有没有办法在PHP中做到这一点?
有没有一种方法可以在php中做到这一点?
是的,您可以先将所有您不感兴趣的标签剥离,然后使用 DOMDocument
删除所有不需要的属性。最后,您需要重新运行 strip_tags
以删除添加的标签由 DomDocument
:
$ allowed_tags ='< a>< IMG>';
$ allowed_attributes = array('href','src');
$ html = strip_tags($ html,$ allowed_tags);
$ dom = new DOMDocument();
$ dom-> loadHTML($ html);
foreach($ dom-> getElementsByTagName('*')as $ node)
{
foreach($ node-> attributes as $ attribute)
{
if(in_array($ attribute-> name,$ allowed_attributes))继续;
$ node-> removeAttributeNode($ attribute);
}
}
$ html = $ dom-> saveHTML($ dom-> getElementsByTagname('body') - > item(0));
$ html = strip_tags($ html,$ allowed_tags);
Possible Duplicate:
Best methods to parse HTML with PHP
For a project I need to take a HTML page and extract all its text and img tags from it, and keep them in the same order they appear in the web page.
So for example, if the web page is:
<p>Hi</p>
<a href ="test.com" alt="a link"> text link</a>
<img src="test.png" />
<a href ="test.com"><img src="test2.png" /></a>
I would like to retrieve that information with this format:
text - Hi
Link1 - <a href ="test.com">text link</a> notice without alt or other tag
Img1 - test.png
Link2 - <a href ="test.com"><img src="test2.png" /></a> again no tag
Is there a way to make that in PHP?
Is there a way to make that in php ?
Yes, you can first strip all tags you're not interested in and then use DOMDocument
to remove all unwanted attributes. Finally you need to re-run strip_tags
to remove tags added by DomDocument
:
$allowed_tags = '<a><img>';
$allowed_attributes = array('href', 'src');
$html = strip_tags($html, $allowed_tags);
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $node)
{
foreach($node->attributes as $attribute)
{
if (in_array($attribute->name, $allowed_attributes)) continue;
$node->removeAttributeNode($attribute);
}
}
$html = $dom->saveHTML($dom->getElementsByTagname('body')->item(0));
$html = strip_tags($html, $allowed_tags);
这篇关于在PHP中从HTML中提取所有文本和img标签。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!