哪个执行速度更快，无头浏览器或 Curl? [英] Which performs faster, headless browser or Curl?

查看：31 发布时间：2021/12/17 14:13:29 node.js curl web-scraping puppeteer

本文介绍了哪个执行速度更快，无头浏览器或 Curl?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我每天需要打开大约 100,000 个 URL，以便图像和 html 缓存到 Cloudflare 中，因为内容变化相当频繁.

I need to open around 100,000 URLS per day so that the images and html are cached into Cloudflare as the content changes fairly frequently.

我怀疑 Curl 的执行速度可能比无头浏览器(通过 puppeteer 无头浏览器)

I suspect that Curl will probably perform faster than a headless browser (chrome headless via puppeteer)

有没有人有这方面的经验或者有更好的方法吗?

Does anyone have any experience with this or are there better ways of doing it?

推荐答案

首先，我相信 libcurl 的 curl_multi api 比无头浏览器快得多.即使在 PHP 下运行(这是一种比 C 慢得多的语言)，我认为它会比无头浏览器更快，但让我们对其进行测试，使用 https://stackoverflow.com/a/54353191/1067003,

first off, i am confident that libcurl's curl_multi api is significantly faster than a headless browser. even if running under PHP (which is a much slower language than say C), i recon it would be faster than a headless browser, but let's put it to the test, benchmarking it using the code from https://stackoverflow.com/a/54353191/1067003 ,

对这个 PHP 脚本进行基准测试(使用 php 的 curl_multi api，它是 libcurl 的 curl_multi api 的包装器)

benchmark this PHP script (using php's curl_multi api, which is a wrapper around libcurl's curl_multi api)

<?php
declare(strict_types=1);
$urls=array();
for($i=0;$i<100000;++$i){
    $urls[]="http://ratma.net/";
}
validate_urls($urls,500,1000,false,false,false);    
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!
";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

并在无头浏览器中对它进行基准测试，我敢你

and benchmark it with doing the same in a headless browser, i dare you

根据记录，ratma.net 在加拿大，这里来自另一个数据中心，但也在加拿大:

for the record, ratma.net is in Canada, and here is from another datacenter but also in Canada:

foo@foo:/srv/http/default/www# time php foo.php

real    0m32.606s
user    0m19.561s
sys     0m12.991s

它在 32.6 秒内完成了 100,000 个请求，这意味着每秒 3067 个请求.我还没有真正检查过，但我预计无头浏览器的性能会比这差得多.

it completed 100,000 requests in 32.6 seconds, that means 3067 requests per second. i haven't actually checked, but i expect a headless browser to perform significantly worse than that.

(ps 注意这个脚本不会下载整个内容，它发出一个 HTTP HEAD 请求而不是一个 HTTP GET 请求，如果你想让它下载整个内容然后替换 CURLOPT_NOBODY=>1 with CURLOPT_WRITEFUNCTION=>function($ch,string $data){return strlen($data);} )

(ps note that this script does not download the entire content, and it issue a HTTP HEAD request instead of a HTTP GET request, if you want it to download the entire content then replace CURLOPT_NOBODY=>1 with CURLOPT_WRITEFUNCTION=>function($ch,string $data){return strlen($data);} )

这篇关于哪个执行速度更快，无头浏览器或 Curl?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

哪个执行速度更快，无头浏览器或 Curl? [英] Which performs faster, headless browser or Curl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哪个执行速度更快，无头浏览器或 Curl? [英] Which performs faster, headless browser or Curl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭