如何使用 cURL 获取页面内容? [英] How to get page content using cURL?

查看:80
本文介绍了如何使用 cURL 获取页面内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想刮的这个内容Google 搜索结果页面 使用 curl.我一直在尝试设置不同的用户代理,并设置其他选项,但我似乎无法获取该页面的内容,因为我经常被重定向或出现页面移动"错误.

I would like to scrape the content of this Google search result page using curl. I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.

我相信这与查询字符串在某处编码的事实有关,但我真的不知道如何解决这个问题.

I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.

    //$url is the same as the link above
    $ch = curl_init();
    $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
    curl_setopt ($ch,CURLOPT_TIMEOUT,120);
    curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
    curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
    curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
    echo curl_exec ($ch);

我需要做什么才能让我的 php 代码显示我在浏览器上看到的页面的确切内容?我错过了什么?谁能指出我正确的方向?

What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?

我在 SO 上看到过类似的问题,但没有一个答案可以帮助我.

I've seen similar questions on SO, but none with an answer that could help me.

我尝试使用 Selenium WebDriver 打开链接,结果与 cURL 相同.我仍然认为这与查询字符串中的特殊字符在过程中的某个地方变得混乱有关.

I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.

推荐答案

方法如下:

   /**
     * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
     * array containing the HTTP server response header fields and content.
     */
    function get_web_page( $url )
    {
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  =>"GET",        //set request type post or get
            CURLOPT_POST           =>false,        //set to GET
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     =>"cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      =>"cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;
        return $header;
    }

示例

//Read a web page and check for errors:

$result = get_web_page( $url );

if ( $result['errno'] != 0 )
    ... error: bad url, timeout, redirect loop ...

if ( $result['http_code'] != 200 )
    ... error: no page, no permissions, no service ...

$page = $result['content'];

这篇关于如何使用 cURL 获取页面内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆