如何使用cURL获取页面内容? [英] How to get page content using cURL?
问题描述
我想凑这个内容的使用curl的Google搜索结果页。
我一直在尝试设置不同的用户代理,并设置其他选项,但我只是似乎无法获得该页面的内容,因为我经常被重定向或我得到页面移动错误。 p>
我相信这与查询字符串在某处进行编码的事实有关,但我真的不知道如何解决。
// $ url与上面的链接相同
$ ch = curl_init();
$ user_agent ='Mozilla / 5.0(Windows NT 6.1; rv:8.0)Gecko / 20100101 Firefox / 8.0'
curl_setopt($ ch,CURLOPT_URL,$ url);
curl_setopt($ ch,CURLOPT_USERAGENT,$ user_agent);
curl_setopt($ ch,CURLOPT_HEADER,0);
curl_setopt($ ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ ch,CURLOPT_CONNECTTIMEOUT,120);
curl_setopt($ ch,CURLOPT_TIMEOUT,120);
curl_setopt($ ch,CURLOPT_MAXREDIRS,10);
curl_setopt($ ch,CURLOPT_COOKIEFILE,cookie.txt);
curl_setopt($ ch,CURLOPT_COOKIEJAR,cookie.txt);
echo curl_exec($ ch);
我需要做什么才能让我的PHP代码显示页面的确切内容会在我的浏览器上看到吗?我缺少什么?任何人都可以指向正确的方向?
我在SO上看到过类似的问题,但没有一个答案可以帮助我。
编辑:
我试图使用Selenium WebDriver打开链接,它给出的结果与cURL相同。我仍然认为这是因为在查询字符串中有一些特殊的字符在进程中的某处被弄乱了。
这是如何:
/ **
*获取网络文件,XML,图像等)。返回
*数组,其中包含HTTP服务器响应头字段和内容。
* /
function get_web_page($ url)
{
$ user_agent ='Mozilla / 5.0(Windows NT 6.1; rv:8.0)Gecko / 20100101 Firefox / 8.0'
$ options = array(
CURLOPT_CUSTOMREQUEST =>GET,//设置请求类型post或get
CURLOPT_POST => false,//设置to GET
CURLOPT_USERAGENT => $ user_agent,//设置用户代理
CURLOPT_COOKIEFILE =>cookie.txt,//设置cookie文件
CURLOPT_COOKIEJAR =>cookie.txt ,// set cookie jar
CURLOPT_RETURNTRANSFER => true,//返回网页
CURLOPT_HEADER => false,//不返回标题
CURLOPT_FOLLOWLOCATION => true,//遵循重定向
CURLOPT_ENCODING =>,//处理所有编码
CURLOPT_AUTOREFERER => true,//在重定向上设置referer
CURLOPT_CONNECTTIMEOUT => 120, b $ b CURLOPT_TIMEOUT => 120,//响应时超时
CURLOPT_MAXREDIRS => 10,//在10次重定向后停止
);
$ ch = curl_init($ url);
curl_setopt_array($ ch,$ options);
$ content = curl_exec($ ch);
$ err = curl_errno($ ch);
$ errmsg = curl_error($ ch);
$ header = curl_getinfo($ ch);
curl_close($ ch);
$ header ['errno'] = $ err;
$ header ['errmsg'] = $ errmsg;
$ header ['content'] = $ content;
return $ header;
}
示例
//读取网页并检查错误:
$ result = get_web_page($ url);
if($ result ['errno']!= 0)
...错误:错误的url,超时,重定向循环...
if $ result ['http_code']!= 200)
...错误:没有页面,没有权限,没有服务...
$ page = $ result ['content'];
I would like to scrape the content of this Google search result page using curl. I've been trying setting different user agents, and setting other options but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.
I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.
//$url is the same as the link above
$ch = curl_init();
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_CONNECTTIMEOUT,120);
curl_setopt ($ch,CURLOPT_TIMEOUT,120);
curl_setopt ($ch,CURLOPT_MAXREDIRS,10);
curl_setopt ($ch,CURLOPT_COOKIEFILE,"cookie.txt");
curl_setopt ($ch,CURLOPT_COOKIEJAR,"cookie.txt");
echo curl_exec ($ch);
What do I need to do to get my php code to show the exact content of the page as I would see it on my browser? What am I missing? Can anyone point me to the right direction?
I've seen similar questions on SO, but none with an answer that could help me.
EDIT:
I tried to just open the link using the Selenium WebDriver, that gives the same results as cURL. I am still thinking that this has to do with the fact that there are special characters in the query string which are getting messed up somewhere in the process.
this is how:
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url )
{
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET", //set request type post or get
CURLOPT_POST =>false, //set to GET
CURLOPT_USERAGENT => $user_agent, //set user agent
CURLOPT_COOKIEFILE =>"cookie.txt", //set cookie file
CURLOPT_COOKIEJAR =>"cookie.txt", //set cookie jar
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
Example
//Read a web page and check for errors:
$result = get_web_page( $url );
if ( $result['errno'] != 0 )
... error: bad url, timeout, redirect loop ...
if ( $result['http_code'] != 200 )
... error: no page, no permissions, no service ...
$page = $result['content'];
这篇关于如何使用cURL获取页面内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!