如何使用CURL从页面解析实际的HTML? [英] How to parse actual HTML from page using CURL?

查看:238
本文介绍了如何使用CURL从页面解析实际的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在网页中抓取具有以下结构的网页:

 < p class = 行 > 
< span>这里的东西< / span>
< a href =http://www.host.tld/file.html>描述性链接文本< / a>
< div>链接说明此处< / div>
< / p>

我正在使用curl刮取网页:

 <?php 
$ handle = curl_init();
curl_setopt($ handle,CURLOPT_URL,http://www.host.tld/);
curl_setopt($ handle,CURLOPT_RETURNTRANSFER,true);
$ html = curl_exec($ handle);
curl_close($ handle);
?>

我已经做了一些研究,发现我不应该使用RegEx来解析返回的HTML从卷曲,我应该使用PHP DOM。这是我如何做到这一点:

  $ newDom = new domDocument; 
$ newDom-> loadHTML($ html);
$ newDom-> preserveWhiteSpace = false;
$ sections = $ newDom-> getElementsByTagName('p');
$ nodeNo = $ sections-> length; ($ i = 0; $ i $ $ nodeNo; $ i ++)

$ printString = $ sections-> item($ i) - > nodeValue;
echo $ printString。 <峰; br> 中;
}

现在我不是假装我完全明白了这一点,我确实得到了我想要的部分。唯一的问题是我所得到的只是HTML页面的文本,就好像我把它从浏览器窗口中复制出来。我想要的是实际的HTML,因为我想提取链接并使用它们,如下所示:

  for($ i = $; $ i $ $ nodeNo; $ i ++){
$ printString = $ sections-> item($ i) - > nodeValue;
echo< a href = \<提取链接> \> LINK< / a>。 $ printString。 <峰; br> 中;
}

如你所见,我无法获得链接,因为我只能得到文字,而不是来源,就像我想要的那样。我知道curl_exec正在拉HTML,因为我已经尝试过了,所以我相信DOM是以某种方式剥离我想要的HTML。

解决方案

根据 DOM上的PHP手册的评论您应该在循环中使用以下内容:

  $ tmp_dom = new DOMDocument(); 
$ tmp_dom-> appendChild($ tmp_dom-> importNode($ sections-> item($ i),true));
$ innerHTML = trim($ tmp_dom-> saveHTML());

这将设置 $ innerHTML 成为节点的HTML内容。



但我想你真正想要的是在'p'节点下获取'a'节点,所以这样做:

  $ sections = $ newDom-> getElementsByTagName('p'); 
$ nodeNo = $ sections-> length; ($ i = 0; $ i $ $ nodeNo; $ i ++)
{
$ sec = $ sections-> item($ i);
$ links = $ sec-> getElementsByTagName('a');
$ linkNo = $ links-> length; ($ j = 0; $ j $ lt; $ linkNo; $ j ++){
$ printString = $ links-> item($ j) - > nodeValue;

echo $ printString。 <峰; br> 中;
}
}

这将打印每个链接的正文。 / p>

I am "attempting" to scrape a web page that has the following structures within the page:

<p class="row">
    <span>stuff here</span>
    <a href="http://www.host.tld/file.html">Descriptive Link Text</a>
    <div>Link Description Here</div>
</p>

I am scraping the webpage using curl:

<?php
    $handle = curl_init();
    curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($handle);
    curl_close($handle);
?>

I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:

$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo $printString . "<br>";
}

Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:

for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}

As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

解决方案

According to comments on the PHP manual on DOM, you should use the following inside your loop:

    $tmp_dom = new DOMDocument();
    $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
    $innerHTML = trim($tmp_dom->saveHTML()); 

This will set $innerHTML to be the HTML content of the node.

But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:

$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
    $sec = $sections->item($i);
    $links = $sec->getElementsByTagName('a');
    $linkNo = $links->length;
    for ($j=0; $j<$linkNo; $j++) {
        $printString = $links->item($j)->nodeValue;
        echo $printString . "<br>";
    }
}

This will just print the body of each link.

这篇关于如何使用CURL从页面解析实际的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆