如何使用cURL获取页面内容? [英] How to get page content using cURL?

查看:127
本文介绍了如何使用cURL获取页面内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想凑这个内容的使用curl的Google搜索结果页
我一直在尝试设置不同的用户代理,并设置其他选项,但我只是似乎无法获得该页面的内容,因为我经常被重定向或我得到页面移动错误。 p>

我相信这与查询字符串在某处进行编码的事实有关,但我真的不知道如何解决。

  // $ url与上面的链接相同
$ ch = curl_init();
$ user_agent ='Mozilla / 5.0(Windows NT 6.1; rv:8.0)Gecko / 20100101 Firefox / 8.0'
curl_setopt($ ch,CURLOPT_URL,$ url);
curl_setopt($ ch,CURLOPT_USERAGENT,$ user_agent);
curl_setopt($ ch,CURLOPT_HEADER,0);
curl_setopt($ ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ ch,CURLOPT_CONNECTTIMEOUT,120);
curl_setopt($ ch,CURLOPT_TIMEOUT,120);
curl_setopt($ ch,CURLOPT_MAXREDIRS,10);
curl_setopt($ ch,CURLOPT_COOKIEFILE,cookie.txt);
curl_setopt($ ch,CURLOPT_COOKIEJAR,cookie.txt);
echo curl_exec($ ch);

我需要做什么才能让我的PHP代码显示页面的确切内容会在我的浏览器上看到吗?我缺少什么?任何人都可以指向正确的方向?



我在SO上看到过类似的问题,但没有一个答案可以帮助我。



编辑:



我试图使用Selenium WebDriver打开链接,它给出的结果与cURL相同。我仍然认为这是因为在查询字符串中有一些特殊的字符在进程中的某处被弄乱了。

解决方案

这是如何:

  / ** 
*获取网络文件,XML,图像等)。返回
*数组,其中包含HTTP服务器响应头字段和内容。
* /
function get_web_page($ url)
{
$ user_agent ='Mozilla / 5.0(Windows NT 6.1; rv:8.0)Gecko / 20100101 Firefox / 8.0'

$ options = array(

CURLOPT_CUSTOMREQUEST =>GET,//设置请求类型post或get
CURLOPT_POST => false,//设置to GET
CURLOPT_USERAGENT => $ user_agent,//设置用户代理
CURLOPT_COOKIEFILE =>cookie.txt,//设置cookie文件
CURLOPT_COOKIEJAR =>cookie.txt ,// set cookie jar
CURLOPT_RETURNTRANSFER => true,//返回网页
CURLOPT_HEADER => false,//不返回标题
CURLOPT_FOLLOWLOCATION => true,//遵循重定向
CURLOPT_ENCODING =>,//处理所有编码
CURLOPT_AUTOREFERER => true,//在重定向上设置referer
CURLOPT_CONNECTTIMEOUT => 120, b $ b CURLOPT_TIMEOUT => 120,//响应时超时
CURLOPT_MAXREDIRS => 10,//在10次重定向后停止
);

$ ch = curl_init($ url);
curl_setopt_array($ ch,$ options);
$ content = curl_exec($ ch);
$ err = curl_errno($ ch);
$ errmsg = curl_error($ ch);
$ header = curl_getinfo($ ch);
curl_close($ ch);

$ header ['errno'] = $ err;
$ header ['errmsg'] = $ errmsg;
$ header ['content'] = $ content;
return $ header;
}

示例

  //读取网页并检查错误:

$ result = get_web_page($ url);

if($ result ['errno']!= 0)
...错误:错误的url,超时,重定向循环...

if $ result ['http_code']!= 200)
...错误:没有页面,没有权限,没有服务...

$ page = $ result ['content'];


I would like to scrape the content of this Google search result page using curl. I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.

I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?

I've seen similar questions on SO, but none with an answer that could help me.

EDIT:

I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.

解决方案

this is how:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

Example

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];

这篇关于如何使用cURL获取页面内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆