DomDocument类无法访问domnode [英] DomDocument class unable access domnode

查看:98
本文介绍了DomDocument类无法访问domnode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不解析这个网址: http://foldmunka.net

$ch = curl_init("http://foldmunka.net");

//curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
clearstatcache();
if ($data === false) {
  echo 'cURL failed';
  exit;
}
$dom = new DOMDocument();
$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
$data = preg_replace('/<\!\-\-\[if(.*)\]>/', '', $data);
$data = str_replace('<![endif]-->', '', $data);
$data = str_replace('<!--', '', $data);
$data = str_replace('-->', '', $data);
$data = preg_replace('@<script[^>]*?>.*?</script>@si', '', $data);
$data = preg_replace('@<style[^>]*?>.*?</style>@si', '', $data);

$data = mb_convert_encoding($data, 'HTML-ENTITIES', "utf-8");
@$dom->loadHTML($data);

$els = $dom->getElementsByTagName('*');
foreach($els as $el){
  print $el->nodeName." | ".$el->getAttribute('content')."<hr />";
  if($el->getAttribute('title'))$el->nodeValue = $el->getAttribute('title')." ".$el->nodeValue;
  if($el->getAttribute('alt'))$el->nodeValue = $el->getAttribute('alt')." ".$el->nodeValue;
  print $el->nodeName." | ".$el->nodeValue."<hr />";
}

我需要顺序 alt,title属性和简单的文本,但是这个页面我不能访问body标签中的节点。

I need sequentially the alt, title attributes and the simple text, but this page i cannot access the nodes within the body tag.

推荐答案

这是一个解决方案,使用DomDocument和DOMXPath 。它比使用简单的HTML DOM解析器的其他解决方案短得多,运行速度更快(约〜100ms,对〜2300ms)。

Here is a solution with DomDocument and DOMXPath. It is much shorter and runs much faster (~100ms against ~2300ms) than the other solution with Simple HTML DOM Parser.

<?php

function makePlainText($source)
{
    $dom = new DOMDocument();
    $dom->loadHtmlFile($source);

    // use this instead of loadHtmlFile() to load from string:
    //$dom->loadHtml('<html><title>Hello</title><body>Hello this site<img src="asdasd.jpg" alt="alt attr" title="title attr"><a href="open.php" alt="alt attr" title="title attr">click</a> Some text.</body></html>');

    $xpath = new DOMXPath($dom);

    $plain = '';

    foreach ($xpath->query('//text()|//a|//img') as $node)
    {
        if ($node->nodeName == '#cdata-section')
            continue;

        if ($node instanceof DOMElement)
        {
            if ($node->hasAttribute('alt'))
                $plain .= $node->getAttribute('alt') . ' ';
            if ($node->hasAttribute('title'))
                $plain .= $node->getAttribute('title') . ' ';
        }
        if ($node instanceof DOMText)
            $plain .= $node->textContent . ' ';
    }

    return $plain;
}

echo makePlainText('http://foldmunka.net');

这篇关于DomDocument类无法访问domnode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆