使用PHP从div类中提取所有内容(包括HTML) [英] Extract all content (including HTML) from a div class using PHP
问题描述
示例HTML ...
Example HTML...
<html>
<head></head>
<body>
<table>
<tr>
<td class="rsheader"><b>Header Content</b></td>
</tr>
<tr>
<td class="rstext">Some text (Most likely will contain lots of HTML</td>
</tr>
</table>
</body>
</html>
我需要转换一页HTML页面是HTML页面的模板版本,HTML页面由几个框组成,每个框都有一个标题(在上面的代码中称为rsheader)和一些文本(在上述代码中称为rstext )
I need to convert a page of HTML into a templated version of that HTML page. The HTML page is made up of several boxes, each with a header (refered to in the above code as "rsheader") and some text (refered to in the above code as "rstext").
我正在尝试编写一个PHP脚本来检索HTML页面,也许使用file_get_contents,然后提取rsheader和rstext div中的任何内容,基本上我不知道如何!我尝试过DOM实验,但我不太了解,尽管我设法提取文本,但忽略了任何HTML。
I'm trying to write a PHP script to retrieve the HTML page maybe using file_get_contents and then to extract whatever content is within the rsheader and rstext divs. Basically I don't know how to! I've tried experimenting with DOM but I don't know it too well and although I did manage to extract the text, it ignored any HTML.
我的PHP ...
<?php
$html = '<html>
<head></head>
<body>
<table>
<tr>
<td class="rsheader"><b>Header Content</b></td>
</tr>
<tr>
<td class="rstext">Some text (Most likely will contain lots of HTML</td>
</tr>
</table>
</body>
</html>';
$dom = new DomDocument();
$dom->loadHtml($html);
$xpath = new DomXpath($dom);
$div = $xpath->query('//*[@class="rsheader"]')->item(0);
echo $div->textContent;
?>
如果我做一个print_r($ div)我看到这个...
If I do a print_r($div) I see this...
DOMElement Object
(
[tagName] => td
[schemaTypeInfo] =>
[nodeName] => td
[nodeValue] => Header Content
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] =>
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => td
[baseURI] =>
[textContent] => Header Content
)
如您所见,textContent节点中没有HTML标签让我相信我会这样做错误的方式:(
As you can see there are no HTML tags within the textContent node which leaves me to believe I'm going about it the wrong way :(
真的希望有人能够给我一些帮助...
Really hoping someone might be able to give me some help with this...
提前感谢
Paul
推荐答案
X-Path可能比这个任务要多一点,我会尝试使用DOMDocument的 getElementById()方法,下面的例子是从这篇文章。
注意:更新为使用标签和类名而不是元素ID。 >
NOTE: Updated to use tag and class names instead of element IDs.
function getChildHtml( $node )
{
$innerHtml= '';
$children = $node->childNodes;
foreach( $children as $child )
{
$innerHtml .= sprintf( '%s%s', $innerHtml, $child->ownerDocument->saveXML( $child ) );
}
return $innerHtml;
}
$dom = new DomDocument();
$dom->loadHtml( $html );
// Gather all table cells in the document.
$cells = $dom->getElementsByTagName( 'td' );
// Loop through the collected table cells looking for those of class 'rsheader' or 'rstext'.
foreach( $cells as $cell )
{
if( $cell->getAttribute( 'class' ) == 'rsheader' )
{
$headerHtml = getChildHtml( $cell );
// Do something with header html.
}
if( $cell->getAttribute( 'class' ) == 'rstext' )
{
$textHtml = getChildHtml( $cell );
// Do something with text html.
}
}
这篇关于使用PHP从div类中提取所有内容(包括HTML)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!