如何使用CURL从页面解析实际的HTML？ [英] How to parse actual HTML from page using CURL?

查看：238 发布时间：2017/6/24 22:41:26 php html regex dom

本文介绍了如何使用CURL从页面解析实际的HTML？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试在网页中抓取具有以下结构的网页：

 < p class = 行 > 
< span>这里的东西< / span> 
< a href =http：//www.host.tld/file.html>描述性链接文本< / a> 
< div>链接说明此处< / div> 
< / p>

我正在使用curl刮取网页：

 <？php 
 $ handle = curl_init（）; 
 curl_setopt（$ handle，CURLOPT_URL，http：//www.host.tld/）; 
 curl_setopt（$ handle，CURLOPT_RETURNTRANSFER，true）; 
 $ html = curl_exec（$ handle）; 
 curl_close（$ handle）; 
？>

我已经做了一些研究，发现我不应该使用RegEx来解析返回的HTML从卷曲，我应该使用PHP DOM。这是我如何做到这一点：

  $ newDom = new domDocument; 
 $ newDom-> loadHTML（$ html）; 
 $ newDom-> preserveWhiteSpace = false; 
 $ sections = $ newDom-> getElementsByTagName（'p'）; 
 $ nodeNo = $ sections-> length; （$ i = 0; $ i $ $ nodeNo; $ i ++）
 
 $ printString = $ sections-> item（$ i） - > nodeValue; 
 echo $ printString。 <峰; br> 中; 
}

现在我不是假装我完全明白了这一点，我确实得到了我想要的部分。唯一的问题是我所得到的只是HTML页面的文本，就好像我把它从浏览器窗口中复制出来。我想要的是实际的HTML，因为我想提取链接并使用它们，如下所示：

  for（$ i = $; $ i $ $ nodeNo; $ i ++）{
 $ printString = $ sections-> item（$ i） - > nodeValue; 
 echo< a href = \<提取链接> \> LINK< / a>。 $ printString。 <峰; br> 中; 
}

如你所见，我无法获得链接，因为我只能得到文字，而不是来源，就像我想要的那样。我知道curl_exec正在拉HTML，因为我已经尝试过了，所以我相信DOM是以某种方式剥离我想要的HTML。

解决方案

根据 DOM上的PHP手册的评论您应该在循环中使用以下内容：

  $ tmp_dom = new DOMDocument（）; 
 $ tmp_dom-> appendChild（$ tmp_dom-> importNode（$ sections-> item（$ i），true））; 
 $ innerHTML = trim（$ tmp_dom-> saveHTML（））;

这将设置 $ innerHTML 成为节点的HTML内容。

但我想你真正想要的是在'p'节点下获取'a'节点，所以这样做：

  $ sections = $ newDom-> getElementsByTagName（'p'）; 
 $ nodeNo = $ sections-> length; （$ i = 0; $ i $ $ nodeNo; $ i ++）
 {
 $ sec = $ sections-> item（$ i）; 
 $ links = $ sec-> getElementsByTagName（'a'）; 
 $ linkNo = $ links-> length; （$ j = 0; $ j $ lt; $ linkNo; $ j ++）{
 $ printString = $ links-> item（$ j） - > nodeValue; 
 
 echo $ printString。 <峰; br> 中; 
} 
}

这将打印每个链接的正文。 / p>

I am "attempting" to scrape a web page that has the following structures within the page:

<p class="row">
    <span>stuff here</span>
    <a href="http://www.host.tld/file.html">Descriptive Link Text</a>
    <div>Link Description Here</div>
</p>

I am scraping the webpage using curl:

<?php
    $handle = curl_init();
    curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($handle);
    curl_close($handle);
?>

I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:

$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo $printString . "<br>";
}

Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:

for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}

As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the "curl_exec" is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

解决方案

According to comments on the PHP manual on DOM, you should use the following inside your loop:

    $tmp_dom = new DOMDocument();
    $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
    $innerHTML = trim($tmp_dom->saveHTML());

This will set $innerHTML to be the HTML content of the node.

But I think what you really want is to get the 'a' nodes under the 'p' node, so do this:

$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++) {
    $sec = $sections->item($i);
    $links = $sec->getElementsByTagName('a');
    $linkNo = $links->length;
    for ($j=0; $j<$linkNo; $j++) {
        $printString = $links->item($j)->nodeValue;
        echo $printString . "<br>";
    }
}

This will just print the body of each link.

这篇关于如何使用CURL从页面解析实际的HTML？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用CURL从页面解析实际的HTML？ [英] How to parse actual HTML from page using CURL?

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

如何使用CURL从页面解析实际的HTML？ [英] How to parse actual HTML from page using CURL?

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭