如何提取节点包含的html注释和所有html? [英] How to extract html comments and all html contained by node?

查看:25
本文介绍了如何提取节点包含的html注释和所有html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个小网络应用程序来帮助我管理和分析我网站的内容,cURL 是我最喜欢的新玩具.我已经想出了如何提取有关各种元素的信息,如何找到具有某个类的所有元素等,但我遇到了两个问题(见下文).我希望有一些漂亮的 xpath 答案,但如果我不得不求助于正则表达式,我想没关系.虽然我对正则表达式不太了解,所以如果你认为这是要走的路,我会很感激例子......

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...

非常标准的起点:

$ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $html = curl_exec($ch);
    if (!$html) {
        $info .= "<br />cURL error number:" .curl_errno($ch);
        $info .= "<br />cURL error:" . curl_error($ch);
        return $info;
    }

    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXPath($dom);

和提取信息,例如:

// iframes
    $iframes = $xpath->evaluate("/html/body//iframe");
    $info .= '<h3>iframes ('.$iframes->length.'):</h3>';
    for ($i = 0; $i < $iframes->length; $i++) {
        // get iframe attributes
        $iframe = $iframes->item($i);
        $framesrc = $iframe->getAttribute("src");
        $framewidth = $iframe->getAttribute("width");
        $frameheight = $iframe->getAttribute("height");
        $framealt = $iframe->getAttribute("alt");
        $frameclass = $iframe->getAttribute("class");
        $info .= $framesrc.'&nbsp;('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
    }

问题/问题:

  1. 如何提取 HTML 注释?

  1. How to extract HTML comments?

我不知道如何识别评论——它们被认为是节点,还是其他什么东西?

I can't figure out how to identify the comments – are they considered nodes, or something else entirely?

如何获取一个div的全部内容,包括子节点?因此,如果 div 包含一个图像和几个 hrefs,它会找到这些并将其作为 HTML 块返回给我.

How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.

推荐答案

注释节点应该很容易在 XPath 中通过 comment() 测试找到,类似于 text() 测试:

Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:

$comments = $xpath->query('//comment()'); // or another path, as you prefer

它们是标准节点:这里是 DOMComment 的手册条目.

They are standard nodes: here is the manual entry for the DOMComment class.

对于你的另一个问题,这有点棘手.最简单的方法是使用 saveXML() 及其可选的 $node 参数:

To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:

$html = $dom->saveXML($el);  // $el should be the element you want to get 
                             // the HTML for

这篇关于如何提取节点包含的html注释和所有html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆