php proDOM解析错误 [英] php proDOM parsing error

查看:131
本文介绍了php proDOM解析错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面的代码来解析dom文档,但是最后我得到错误
google.ac为空或不是对象
402
char 1

我想,第402行包含标签和很多;,
我如何解决这个问题?

 <?php 

// $ ch = curl_init(http://images.google.com/images?q=books&tbm = isch /);


//创建一个新的cURL资源
$ ch = curl_init();

//设置网址和其他适当的选项
curl_setopt($ ch,CURLOPT_URL,http://images.google.com/images?q=books&tbm=isch/) ;
curl_setopt($ ch,CURLOPT_HEADER,0);

//抓取URL并将其传递给浏览器
$ data = curl_exec($ ch);

curl_close($ ch);

$ dom = new DOMDocument();
$ dom-> loadHTML($ data);
//@$dom->saveHTMLFile('newfolder/abc.html')

$ dom-> loadHTML('$ data');

//找到所有ul

$ list = $ dom-> getElementsByTagName('ul');
//获取几个列表项目

$ rows = $ list-> item(30) - > getElementsByTagName('li');
//从表中获取锚点

$ links = $ list-> item(30) - > getElementsByTagName('a');

foreach($ links as $ link){
echo< fieldset>;
$ links = $ link-> getElementsByAttribute('imgurl');

$ dom-> saveXML($ links);
}
?>


解决方案

代码有几个问题: p>


  1. 您应该添加CURL选项 - CURLOPT_RETURNTRANSFER - 以捕获输出。默认情况下,输出显示在浏览器上。像这样: curl_setopt($ ch,CURLOPT_RETURNTRANSFER,TRUE); 。在上面的代码中, $ data 将始终为TRUE或FALSE(http://www.php.net/manual/en/function.curl-exec.php)


  2. $ dom-> loadHTML('$ data'); 不正确, / p>


  3. 读取li和a标签的方法可能不正确,因为$ list-> item(30)总是指向第30个元素


无论如何,我不知道你是否检查了CURL请求返回的HTML,但它似乎不同于我们在原始帖子。换句话说,CURL返回的HTML不包含所需的< ul> < li> 元素。它包含< td> < a> 元素。



加载项:我不太确定为什么从浏览器和从PHP读取时,同一网页的HTML不同。但这里是一个推理,我认为可能适合。该网页使用JavaScript代码,在网页加载时动态呈现一些HTML代码。当从浏览器而不是从PHP查看时,可以看到此动态HTML。因此,我假定动态生成< ul> < li> 标签。



因此,您应该修改代码以解析< a> ; 元素,然后读取图像URL。此代码段可能有助于:

 <?php 
$ ch = curl_init //创建一个新的cURL资源

//设置网址和其他适当的选项
curl_setopt($ ch,CURLOPT_URL,http://images.google.com/images?q=books& ; tbm = isch /);
curl_setopt($ ch,CURLOPT_HEADER,0);
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,TRUE);

$ data = curl_exec($ ch); // grab URL并将其传递给浏览器
curl_close($ ch);

$ dom = new DOMDocument();
@ $ dom-> loadHTML($ data); // avoid warnings

$ listA = $ dom-> getElementsByTagName('a'); // read all< a>元素
foreach($ listA as $ itemA){//循环遍历每个< a>元素
if($ itemA> hasAttribute('href')){//检查是否有一个'href'属性
$ href = $ itemA-> getAttribute('href'); //读取'href'的值
if(preg_match('/ ^ \ / imgres\?/',$ href)){//检查'href'应以/ imgres?
$ qryString = substr($ href,strpos($ href,'?')+ 1);
parse_str($ qryString,$ arrHref); //从'href'URI读取查询参数
echo'< br>'。 $ arrHref ['imgurl']。 '< br>';
}
}
}

但请注意,如果Google修改其HTML,上述解析可能会失败。


I am using the following code for parsing dom document but at the end I get the error "google.ac" is null or not an object line 402 char 1

What I guess, line 402 contains tag and a lot of ";", How can I fix this?

<?php

//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");


// create a new cURL resource
$ch = curl_init();

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);

// grab URL and pass it to the browser
$data = curl_exec($ch);

curl_close($ch); 

$dom = new DOMDocument();
       $dom->loadHTML($data);
    //@$dom->saveHTMLFile('newfolder/abc.html')

     $dom->loadHTML('$data');

    // find all ul

    $list = $dom->getElementsByTagName('ul'); 
    // get few  list items 

    $rows = $list->item(30)->getElementsByTagName('li'); 
    // get anchors from the table   

    $links = $list->item(30)->getElementsByTagName('a'); 

    foreach ($links as $link) { 
        echo "<fieldset>"; 
        $links = $link->getElementsByAttribute('imgurl');

    $dom->saveXML($links);
                }
?>

解决方案

There are a few issues with the code:

  1. You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)

  2. $dom->loadHTML('$data'); is not correct and not required

  3. The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element

Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.

Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.

Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:

<?php
$ch = curl_init(); // create a new cURL resource

// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch); 

$dom = new DOMDocument();
@$dom->loadHTML($data); // avoid warnings

$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
    if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
        $href = $itemA->getAttribute('href'); // read the value of 'href'
        if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
            $qryString = substr($href, strpos($href, '?') + 1);
            parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
            echo '<br>' . $arrHref['imgurl'] . '<br>';
        }
    }
}

I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.

这篇关于php proDOM解析错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆