使用多重卷曲获取所有URL [英] Get all the URLs using multi curl

查看:177
本文介绍了使用多重卷曲获取所有URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个应用程序,该应用程序从一系列站点中获取所有URL,并以数组形式或JSON显示。

I'm working on an app that gets all the URLs from an array of sites and displays it in array form or JSON.

我可以使用循环,问题是我尝试10个URL时的执行时间,这给我一个错误,提示超过了最大执行时间

I can do it using for loop, the problem is the execution time when I tried 10 URLs it gives me an error saying exceeds maximum execution time.

在搜索时,我发现了这个多重卷曲

Upon searching I found this multi curl

我也发现了这个快速PHP CURL多个请求:使用检索多个URL的内容CURL 。我试图添加我的代码,但是由于我不怎么使用该函数而无法正常工作。

I also found this Fast PHP CURL Multiple Requests: Retrieve the content of multiple URLs using CURL. I tried to add my code but didn't work because I don't how to use the function.

希望您能帮到我。

谢谢。

这是我的示例代码。

<?php

$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');


$mh = curl_multi_init();
foreach ($urls as $i => $url) {

        $urlContent = file_get_contents($url);

        $dom = new DOMDocument();
        @$dom->loadHTML($urlContent);
        $xpath = new DOMXPath($dom);
        $hrefs = $xpath->evaluate("/html/body//a");

        for($i = 0; $i < $hrefs->length; $i++){
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
            $url = filter_var($url, FILTER_SANITIZE_URL);
            // validate url
            if(!filter_var($url, FILTER_VALIDATE_URL) === false){
                echo '<a href="'.$url.'">'.$url.'</a><br />';
            }
        }

        $conn[$i]=curl_init($url);
        $fp[$i]=fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
}
do {
    $n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
curl_multi_close($mh);
?>


推荐答案

以下是我将其组合在一起的函数利用 curl_multi_init()函数。它与PHP.net几乎具有相同的功能,但有一些细微的调整。我在此方面取得了巨大的成功。

Here is a function that I put together that will properly utilize the curl_multi_init() function. It is more or less the same function that you will find on PHP.net with some minor tweaks. I have had great success with this.

function multi_thread_curl($urlArray, $optionArray, $nThreads) {

  //Group your urls into groups/threads.
  $curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);

  //Iterate through each batch of urls.
  $ch = 'ch_';
  foreach($curlArray as $threads) {      

      //Create your cURL resources.
      foreach($threads as $thread=>$value) {

      ${$ch . $thread} = curl_init();

        curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
        curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.

        }

      //Create the multiple cURL handler.
      $mh = curl_multi_init();

      //Add the handles.
      foreach($threads as $thread=>$value) {

      curl_multi_add_handle($mh, ${$ch . $thread});

      }

      $active = null;

      //execute the handles.
      do {

      $mrc = curl_multi_exec($mh, $active);

      } while ($mrc == CURLM_CALL_MULTI_PERFORM);

      while ($active && $mrc == CURLM_OK) {

          if (curl_multi_select($mh) != -1) {
              do {

                  $mrc = curl_multi_exec($mh, $active);

              } while ($mrc == CURLM_CALL_MULTI_PERFORM);
          }

      }

      //Get your data and close the handles.
      foreach($threads as $thread=>$value) {

      $results[$thread] = curl_multi_getcontent(${$ch . $thread});

      curl_multi_remove_handle($mh, ${$ch . $thread});

      }

      //Close the multi handle exec.
      curl_multi_close($mh);

  }


  return $results;

} 



//Add whatever options here. The CURLOPT_URL is left out intentionally.
//It will be added in later from the url array.
$optionArray = array(

  CURLOPT_USERAGENT        => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',//Pick your user agent.
  CURLOPT_RETURNTRANSFER   => TRUE,
  CURLOPT_TIMEOUT          => 10

);

//Create an array of your urls.
$urlArray = array(

    'http://site1.com/',
    'http://site2.com/',
    'http://site3.com/'

);

//Play around with this number and see what works best.
//This is how many urls it will try to do at one time.
$nThreads = 20;

//To use run the function.
$results = multi_thread_curl($urlArray, $optionArray, $nThreads);

完成后,您将拥有一个包含网站列表中所有html的数组。此时,我将遍历它们并提取所有网址。

Once this is complete you will have an array containing all of the html from your list of websites. It is at this point where I would loop through them and pull out all of the urls.

就像这样:

foreach($results as $page){

  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXPath($dom);
  $hrefs = $xpath->evaluate("/html/body//a");

  for($i = 0; $i < $hrefs->length; $i++){

    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $url = filter_var($url, FILTER_SANITIZE_URL);
    // validate url
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
    echo '<a href="'.$url.'">'.$url.'</a><br />';
    }

  }

}

增加脚本运行时间的能力也值得您注意。

It is also worth keeping in the back of you head the ability to increase the run time of your script.

如果您使用托管服务,则无论您将最大执行时间设置为多长时间,都可能只限于两分钟之内。值得深思。

If your using a hosting service you may be restricted to something in the ball park of two minutes regardless of what you set your max execution time to. Just food for thought.

此操作通过以下方式完成:

This is done by:

ini_set('max_execution_time',120);

您总是可以尝试更多的时间,但是直到您确定时间后才知道。

You can always try more time but you'll never know till you time it.

希望有帮助。

这篇关于使用多重卷曲获取所有URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆