如何使用CURL从页面解析实际的HTML? [英] How to parse actual HTML from page using CURL?
问题描述
我正在尝试在网页中抓取具有以下结构的网页:
< p class = 行 >
< span>这里的东西< / span>
< a href =http://www.host.tld/file.html>描述性链接文本< / a>
< div>链接说明此处< / div>
< / p>
我正在使用curl刮取网页:
<?php
$ handle = curl_init();
curl_setopt($ handle,CURLOPT_URL,http://www.host.tld/);
curl_setopt($ handle,CURLOPT_RETURNTRANSFER,true);
$ html = curl_exec($ handle);
curl_close($ handle);
?>
我已经做了一些研究,发现我不应该使用RegEx来解析返回的HTML从卷曲,我应该使用PHP DOM。这是我如何做到这一点:
$ newDom = new domDocument;
$ newDom-> loadHTML($ html);
$ newDom-> preserveWhiteSpace = false;
$ sections = $ newDom-> getElementsByTagName('p');
$ nodeNo = $ sections-> length; ($ i = 0; $ i $ $ nodeNo; $ i ++)
$ printString = $ sections-> item($ i) - > nodeValue;
echo $ printString。 <峰; br> 中;
}
现在我不是假装我完全明白了这一点,我确实得到了我想要的部分。唯一的问题是我所得到的只是HTML页面的文本,就好像我把它从浏览器窗口中复制出来。我想要的是实际的HTML,因为我想提取链接并使用它们,如下所示:
for($ i = $; $ i $ $ nodeNo; $ i ++){
$ printString = $ sections-> item($ i) - > nodeValue;
echo< a href = \<提取链接> \> LINK< / a>。 $ printString。 <峰; br> 中;
}
如你所见,我无法获得链接,因为我只能得到文字,而不是来源,就像我想要的那样。我知道curl_exec正在拉HTML,因为我已经尝试过了,所以我相信DOM是以某种方式剥离我想要的HTML。
根据 DOM上的PHP手册的评论您应该在循环中使用以下内容:
$ tmp_dom = new DOMDocument();
$ tmp_dom-> appendChild($ tmp_dom-> importNode($ sections-> item($ i),true));
$ innerHTML = trim($ tmp_dom-> saveHTML());
这将设置 $ innerHTML
成为节点的HTML内容。
但我想你真正想要的是在'p'节点下获取'a'节点,所以这样做:
$ sections = $ newDom-> getElementsByTagName('p');
$ nodeNo = $ sections-> length; ($ i = 0; $ i $ $ nodeNo; $ i ++)
{
$ sec = $ sections-> item($ i);
$ links = $ sec-> getElementsByTagName('a');
$ linkNo = $ links-> length; ($ j = 0; $ j $ lt; $ linkNo; $ j ++){
$ printString = $ links-> item($ j) - > nodeValue;
echo $ printString。 <峰; br> 中;
}
}
这将打印每个链接的正文。 / p>
I am "attempting" to scrape a web page that has the following structures within the page:
<p class="row">
<span>stuff here</span>
<a href="http://www.host.tld/file.html">Descriptive Link Text</a>
<div>Link Description Here</div>
</p>
I am scraping the webpage using curl:
<?php
$handle = curl_init();
curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($handle);
curl_close($handle);
?>
I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:
$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo $printString . "<br>";
}
Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:
for($i=0; $i<$nodeNo; $i++){
$printString = $sections->item($i)->nodeValue;
echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}
As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.
According to comments on the PHP manual on DOM, you should use the following inside your loop:
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
$innerHTML = trim($tmp_dom->saveHTML());
This will set $innerHTML
to be the HTML content of the node.
But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
$sec = $sections->item($i);
$links = $sec->getElementsByTagName('a');
$linkNo = $links->length;
for ($j=0; $j<$linkNo; $j++) {
$printString = $links->item($j)->nodeValue;
echo $printString . "<br>";
}
}
This will just print the body of each link.
这篇关于如何使用CURL从页面解析实际的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!