甚至CURL函数也无法抓取某些网址 [英] Even CURL function can't scrape some urls

查看:41
本文介绍了甚至CURL函数也无法抓取某些网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用CURL从网址中抓取html。在我使用的80%的网址中,效果很好。但是某些网址似乎不是可抓取的。例如,当我尝试抓取 http://www.thefancy.com 时,该方法将无效。网站一直在加载,最后没有返回结果。可在以下位置测试该问题: http://www.itemmized.com/test/test/ 这是我的代码:

I'm using CURL to scrape the html from url's. It works great in 80% of the urls I use. But some url's don't seem "scrapeable". For example, when I try to scrape http://www.thefancy.com , it doesn't work. the website keeps loading and at the end it doesn't return a result. the problem is testable at: http://www.itemmized.com/test/test/ this is my code:

 if($_POST['submit']) {

 function curl_exec_follow($ch, &$maxredirect = null) {

 $mr = $maxredirect === null ? 5 : intval($maxredirect);

 if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

} else {

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

if ($mr > 0)
{
  $original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  $newurl = $original_url;

  $rch = curl_copy_handle($ch);

  curl_setopt($rch, CURLOPT_HEADER, true);
  curl_setopt($rch, CURLOPT_NOBODY, true);
  curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
  do
  {
    curl_setopt($rch, CURLOPT_URL, $newurl);
    $header = curl_exec($rch);
    if (curl_errno($rch)) {
      $code = 0;
    } else {
      $code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
      if ($code == 301 || $code == 302) {
        preg_match('/Location:(.*?)\n/', $header, $matches);
        $newurl = trim(array_pop($matches));

        // if no scheme is present then the new url is a
        // relative path and thus needs some extra care
        if(!preg_match("/^https?:/i", $newurl)){
          $newurl = $original_url . $newurl;
        }
      } else {
        $code = 0;
      }
    }
  } while ($code && --$mr);

  curl_close($rch);

  if (!$mr)
  {
    if ($maxredirect === null)
    trigger_error('Too many redirects.', E_USER_WARNING);
    else
    $maxredirect = 0;

    return false;
  }
  curl_setopt($ch, CURLOPT_URL, $newurl);
}
 }
return curl_exec($ch);
 }

 $ch = curl_init($_POST['form_url']);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 $data = curl_exec_follow($ch);
  curl_close($ch);


  echo $data;


推荐答案

尝试一下..希望这对您有所帮助...

Try this.. hope this helps...

<?php


class Curl
{       

public $cookieJar = "";

public function __construct($cookieJarFile = 'cookies.txt') {
    $this->cookieJar = $cookieJarFile;
}

function setup()
{


    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] =  "Cache-Control: max-age=0";
    $header[] =  "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.


    curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
    curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
    curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar); 
    curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
    curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
    curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
}


function get($url)
{ 
    $this->curl = curl_init($url);
    $this->setup();

    return $this->request();
}

function getAll($reg,$str)
{
    preg_match_all($reg,$str,$matches);
    return $matches[1];
}

function postForm($url, $fields, $referer='')
{
    $this->curl = curl_init($url);
    $this->setup();
    curl_setopt($this->curl, CURLOPT_URL, $url);
    curl_setopt($this->curl, CURLOPT_POST, 1);
    curl_setopt($this->curl, CURLOPT_REFERER, $referer);
    curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
    return $this->request();
}

function getInfo($info)
{
    $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
    return $info;
}

function request()
{
    return curl_exec($this->curl);
}
}
{
$curl = new Curl();
$html = $curl->get("http://www.thefancy.com");
echo "$html";
}



?>

这篇关于甚至CURL函数也无法抓取某些网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆