如何提取html注释和节点包含的所有html? [英] How to extract html comments and all html contained by node?

查看:80
本文介绍了如何提取html注释和节点包含的所有html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个网络应用程序,以帮助我管理和分析我的网站的内容,cURL是我最喜欢的新玩具。我已经想出如何提取关于各种元素的信息,如何找到所有元素与某个类等,但我困在两个问题(见下文)。我希望有一些漂亮的xpath答案,但如果我必须诉诸正则表达式,我猜这是确定。虽然我对regex不是那么好,如果你认为这是走的路,我会欣赏的例子...

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...

很标准的起点:

$ch = curl_init();
    curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($ch, CURLOPT_URL,$target_url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);

    $html = curl_exec($ch);
    if (!$html) {
        $info .= "<br />cURL error number:" .curl_errno($ch);
        $info .= "<br />cURL error:" . curl_error($ch);
        return $info;
    }

    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    $xpath = new DOMXPath($dom);

并提取信息,例如:

// iframes
    $iframes = $xpath->evaluate("/html/body//iframe");
    $info .= '<h3>iframes ('.$iframes->length.'):</h3>';
    for ($i = 0; $i < $iframes->length; $i++) {
        // get iframe attributes
        $iframe = $iframes->item($i);
        $framesrc = $iframe->getAttribute("src");
        $framewidth = $iframe->getAttribute("width");
        $frameheight = $iframe->getAttribute("height");
        $framealt = $iframe->getAttribute("alt");
        $frameclass = $iframe->getAttribute("class");
        $info .= $framesrc.'&nbsp;('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
    }

问题/问题:


  1. 如何提取HTML注释?

  1. How to extract HTML comments?

我无法弄清楚如何识别注释 -

I can't figure out how to identify the comments – are they considered nodes, or something else entirely?

如何获取div的整个内容,包括子节点?所以如果div包含一个图像和几个href,它会找到那些,并把它作为一个HTML块回到我。

How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.


comment()测试中,注释节点应该很容易在XPath中找到,类似于下面的例子:

推荐答案

code> text() test:

Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:

$comments = $xpath->query('//comment()'); // or another path, as you prefer

它们是标准节点:这里是 DOMComment 类的手动输入

They are standard nodes: here is the manual entry for the DOMComment class.

对你的另一个问题,这有点棘手。最简单的方法是使用 saveXML() 及其可选的 $ node 参数:

To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:

$html = $dom->saveXML($el);  // $el should be the element you want to get 
                             // the HTML for

这篇关于如何提取html注释和节点包含的所有html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆