哪个执行速度更快，无头浏览器或Curl？ [英] Which performs faster, headless browser or Curl?

查看：79 发布时间：2020/10/13 2:25:17 node.js curl web-scraping puppeteer

本文介绍了哪个执行速度更快，无头浏览器或Curl？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我每天需要打开大约100,000个URL，以便随着内容的频繁更改而将图像和html缓存到Cloudflare中。

I need to open around 100,000 URLS per day so that the images and html are cached into Cloudflare as the content changes fairly frequently.

我怀疑Curl可能会

有人对它有任何经验还是有更好的方法呢？

Does anyone have any experience with this or are there better ways of doing it?

推荐答案

首先，我相信 libcurl的curl_multi api 比无头浏览器快得多。即使在PHP（比C语言慢得多的语言）下运行，我也认为它比无头浏览器要快，但让我们对其进行测试，使用https://stackoverflow.com/a/54353191/1067003 ，

first off, i am confident that libcurl's curl_multi api is significantly faster than a headless browser. even if running under PHP (which is a much slower language than say C), i recon it would be faster than a headless browser, but let's put it to the test, benchmarking it using the code from https://stackoverflow.com/a/54353191/1067003 ,

对此PHP脚本进行基准测试（使用php的curl_multi api，它是libcurl的curl_multi api的包装器）

benchmark this PHP script (using php's curl_multi api, which is a wrapper around libcurl's curl_multi api)

<?php
declare(strict_types=1);
$urls=array();
for($i=0;$i<100000;++$i){
    $urls[]="http://ratma.net/";
}
validate_urls($urls,500,1000,false,false,false);    
// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    // DISABLED for benchmarking purposes: $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

并在无头浏览器中进行同样的基准测试，我敢于

and benchmark it with doing the same in a headless browser, i dare you

根据记录，ratma.net在加拿大，这是在另一个数据中心，也在加拿大：

for the record, ratma.net is in Canada, and here is from another datacenter but also in Canada:

foo@foo:/srv/http/default/www# time php foo.php

real    0m32.606s
user    0m19.561s
sys     0m12.991s

它在32.6秒内完成了100,000个请求，这意味着每秒3067个请求。我尚未实际检查过，但我希望无头浏览器的性能会比这差得多。

it completed 100,000 requests in 32.6 seconds, that means 3067 requests per second. i haven't actually checked, but i expect a headless browser to perform significantly worse than that.

（请注意，此脚本不会下载全部内容，发出HTTP HEAD请求而不是HTTP GET请求，如果您希望下载整个内容，则将 CURLOPT_NOBODY => 1 替换为 CURLOPT_WRITEFUNCTION = > function（$ ch，string $ data）{return strlen（$ data）;} ）

(ps note that this script does not download the entire content, and it issue a HTTP HEAD request instead of a HTTP GET request, if you want it to download the entire content then replace CURLOPT_NOBODY=>1 with CURLOPT_WRITEFUNCTION=>function($ch,string $data){return strlen($data);} )

这篇关于哪个执行速度更快，无头浏览器或Curl？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

哪个执行速度更快，无头浏览器或Curl？ [英] Which performs faster, headless browser or Curl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

哪个执行速度更快，无头浏览器或Curl？ [英] Which performs faster, headless browser or Curl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭