解析HTML情况下的DOM和xpath查询 [英] DOM and xpath query in a parsing HTML case

查看：81 发布时间：2020/10/25 20:28:39 php dom xpath

本文介绍了解析HTML情况下的DOM和xpath查询的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我想解析的HTML：

Here is the HTML I would like to parse :

$html = '
<h1>title</h1>

<div id="main">

<div id="page">

<div class="article">
<h2><span>date1</span> <a href="link1">title1</a></h2>
<p>text1</p>
</div>

<div class="article">
<h2><span>date2</span> <a href="link2">title2</a></h2>
<p>text2</p>
</div>

</div>

</div>';

这就是我想要得到的：

Array
(
[0] => Array
    (
        [link] => link1
        [title] => title1
        [description] => description1
        [date] => date1
    )

[1] => Array
    (
        [link] => link2
        [title] => title2
        [description] => description2
        [date] => date2
    )

)

这是我的PHP：

$doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXpath($doc);
$nodes = $xpath->query("//div[@class='article']/h2/a");
$list = array(); $i = 0;

if($nodes)
{
foreach($nodes as $node) {

    if($node->getAttribute('href')) 
    { $link = $node->getAttribute('href'); $list[$i]['link'] = $link; }

    if($node->nodeValue) 
    { $title = $node->nodeValue; $list[$i]['title'] = $title; }

    if($node->nodeValue) 
    { $description = $node->nodeValue; $list[$i]['description'] = $description; }

    if($node->nodeValue) 
    { $date = $node->nodeValue; $list[$i]['date'] = $date; }

    $i++;
}
}

echo '<pre>';
echo print_r ($list);
echo '</pre>';

对于 link1 ， title1 ， link2 ， title2 ，但不适用于 description1 ， date1 ， description2 ， date2 。

The result is OK for link1, title1, link2, title2 but not for description1, date1, description2, date2.

在PHP手册中，我一直在寻找一些与我相似的案例。但是大多数时候，当处理DOMdocument时，一切都是理论上的。您能帮我还是建议我一些更具体的资源？

I was looking for some specific cases close to mine in the PHP manual. But most of the time, everything is quite theoritical when it deals with DOMdocument. Could you please help me or suggest me some more concrete ressources ?

编辑：以下是$ node

EDIT : here is the content of $node

DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title1
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title1
)
1
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] => 
[nodeName] => a
[nodeValue] => title2
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] => 
[prefix] => 
[localName] => a
[baseURI] => 
[textContent] => title2
)
1

推荐答案

通常我不会以这种方式工作，但这是解决您的问题的方法，我获取的是div而不是锚点：

Normally i wouldn't work this way, but this is a solution for your problem, i'm fetching the article div instead of the anchor:

$aNodes = $xpath->query("//div[@class='article']");
$aList = array(); 
$i = 0;
if($aNodes){
    foreach($aNodes as $aNode) {
        $aDates = $aNode->getElementsByTagName('span');
        foreach ($aDates as $sDate){
            $aList[$i]['date'] = $sDate->nodeValue;
        }
        $aLinks = $aNode->getElementsByTagName('a');
        foreach ($aLinks as $sLink){
            $aList[$i]['link']  = $sLink->getAttribute('href');
            $aList[$i]['linktext'] = $sLink->nodeValue;
        }
        $aTexts = $aNode->getElementsByTagName('p');
        foreach ($aTexts as $sText){
            $aList[$i]['descript'] = $sText->nodeValue;
        }
        $i++;
    }
}
echo '<pre>';
print_r ($aList);
echo '</pre>';

或者，如果您确定布局始终相同：

OR if you are sure the layout is always the same:

foreach($aNodes as $aNode) {
        $aList[$i]['date'] = $aNode->getElementsByTagName('span')->item(0)->nodeValue;
        $aList[$i]['link']  = $aNode->getElementsByTagName('a')->item(0)->getAttribute('href');
        $aList[$i]['linktext']  = $aNode->getElementsByTagName('a')->item(0)->nodeValue;
        $aList[$i]['descript']  = $aNode->getElementsByTagName('p')->item(0)->nodeValue;
        $i++;
    }

这篇关于解析HTML情况下的DOM和xpath查询的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析HTML情况下的DOM和xpath查询 [英] DOM and xpath query in a parsing HTML case

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

解析HTML情况下的DOM和xpath查询 [英] DOM and xpath query in a parsing HTML case

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭