解析HTML情况下的DOM和xpath查询 [英] DOM and xpath query in a parsing HTML case

查看:81
本文介绍了解析HTML情况下的DOM和xpath查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我想解析的HTML:

Here is the HTML I would like to parse :

$html = '
<h1>title</h1>

<div id="main">

<div id="page">

<div class="article">
<h2><span>date1</span> <a href="link1">title1</a></h2>
<p>text1</p>
</div>

<div class="article">
<h2><span>date2</span> <a href="link2">title2</a></h2>
<p>text2</p>
</div>

</div>

</div>';

这就是我想要得到的:

Array
(
[0] => Array
    (
        [link] => link1
        [title] => title1
        [description] => description1
        [date] => date1
    )

[1] => Array
    (
        [link] => link2
        [title] => title2
        [description] => description2
        [date] => date2
    )

)

这是我的PHP:

$doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXpath($doc);
$nodes = $xpath->query("//div[@class='article']/h2/a");
$list = array(); $i = 0;

if($nodes)
{
foreach($nodes as $node) {

    if($node->getAttribute('href')) 
    { $link = $node->getAttribute('href'); $list[$i]['link'] = $link; }

    if($node->nodeValue) 
    { $title = $node->nodeValue; $list[$i]['title'] = $title; }

    if($node->nodeValue) 
    { $description = $node->nodeValue; $list[$i]['description'] = $description; }

    if($node->nodeValue) 
    { $date = $node->nodeValue; $list[$i]['date'] = $date; }

    $i++;
}
}

echo '<pre>';
echo print_r ($list);
echo '</pre>';

对于 link1 title1 link2 title2 ,但不适用于 description1 date1 description2 date2

The result is OK for link1, title1, link2, title2 but not for description1, date1, description2, date2.

在PHP手册中,我一直在寻找一些与我相似的案例。但是大多数时候,当处理DOMdocument时,一切都是理论上的。您能帮我还是建议我一些更具体的资源?

I was looking for some specific cases close to mine in the PHP manual. But most of the time, everything is quite theoritical when it deals with DOMdocument. Could you please help me or suggest me some more concrete ressources ?

编辑:以下是$ node

EDIT : here is the content of $node

DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title1
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title1
)
1
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title2
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title2
)
1


推荐答案

通常我不会以这种方式工作,但这是解决您的问题的方法,我获取的是div而不是锚点:

Normally i wouldn't work this way, but this is a solution for your problem, i'm fetching the article div instead of the anchor:

$aNodes = $xpath->query("//div[@class='article']");
$aList = array(); 
$i = 0;
if($aNodes){
    foreach($aNodes as $aNode) {
        $aDates = $aNode->getElementsByTagName('span');
        foreach ($aDates as $sDate){
            $aList[$i]['date'] = $sDate->nodeValue;
        }
        $aLinks = $aNode->getElementsByTagName('a');
        foreach ($aLinks as $sLink){
            $aList[$i]['link']  = $sLink->getAttribute('href');
            $aList[$i]['linktext'] = $sLink->nodeValue;
        }
        $aTexts = $aNode->getElementsByTagName('p');
        foreach ($aTexts as $sText){
            $aList[$i]['descript'] = $sText->nodeValue;
        }
        $i++;
    }
}
echo '<pre>';
print_r ($aList);
echo '</pre>';

或者,如果您确定布局始终相同:

OR if you are sure the layout is always the same:

foreach($aNodes as $aNode) {
        $aList[$i]['date'] = $aNode->getElementsByTagName('span')->item(0)->nodeValue;
        $aList[$i]['link']  = $aNode->getElementsByTagName('a')->item(0)->getAttribute('href');
        $aList[$i]['linktext']  = $aNode->getElementsByTagName('a')->item(0)->nodeValue;
        $aList[$i]['descript']  = $aNode->getElementsByTagName('p')->item(0)->nodeValue;
        $i++;
    }

这篇关于解析HTML情况下的DOM和xpath查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆