快速验证PHP中的大量URL? [英] quickly validate a large list of URL's in PHP?

查看:73
本文介绍了快速验证PHP中的大量URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含自由文本的内容数据库,大约有11000行数据,每行有87列。因此(可能)有大约957000个字段来检查URL是否有效。

I have a database of content with free text in it There are about 11000 rows of data, and each row has 87 columns. There are thus (potentially) around 957000 fields to check if URLs are valid.

我做了一个正则表达式来提取所有看起来像URL的东西(http / s等)。 ),并建立了一个名为$ urls的数组。然后,我遍历它,将每个$ url传递给我的curl_exec()调用。

I did a regular expression to extract all things that look like URLs (http/s, etc.) and built up an array called $urls. I then loop through it, passing each $url to my curl_exec() call.

我尝试了cURL(对于每个$ url):

I have tried cURL (for each $url):

$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 250);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECT_ONLY, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPGET, 1);
foreach ($urls as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $exec = curl_exec($ch);
    // Extra stuff here... it does add overhead, but not that much.
}
curl_close($ch);

据我所知,这应该工作并尽可能快,但是每个URL大约需要2-3秒。

As far as I can tell, this SHOULD work and be as fast as I can go, but it takes around 2-3 seconds per URL.

有一种更快的方法吗?

我正在计划通过cron作业运行它,然后首先检查我的本地数据库,如果最近30天内已经检查了此URL,如果没有,则进行检查,因此随着时间的流逝,它会减少,但是我只想知道cURL是最好的解决方案,是否我想让它更快一些?

I am planning on running this via a cron job, and then check my local database first if this URL has been checked in the last 30 days, and if not, then check, so over time this will become less, but I just want to know if cURL is the best solution, and whether I am missing something to make it faster?

EDIT:
根据评论bby Nick下面的祖鲁语,我现在坐在这段代码中:

Based on the comment bby Nick Zulu below, I sit with this code now:

function ODB_check_url_array($urls, $debug = true) {
  if (!empty($urls)) {
    $mh = curl_multi_init();
    foreach ($urls as $index => $url) {
      $ch[$index] = curl_init($url);
      curl_setopt($ch[$index], CURLOPT_CONNECTTIMEOUT_MS, 10000);
      curl_setopt($ch[$index], CURLOPT_NOBODY, 1);
      curl_setopt($ch[$index], CURLOPT_FAILONERROR, 1);
      curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, 1);
      curl_setopt($ch[$index], CURLOPT_CONNECT_ONLY, 1);
      curl_setopt($ch[$index], CURLOPT_HEADER, 1);
      curl_setopt($ch[$index], CURLOPT_HTTPGET, 1);
      curl_multi_add_handle($mh, $ch[$index]);
    }
    $running = null;
    do {
      curl_multi_exec($mh, $running);
    } while ($running);
    foreach ($ch as $index => $response) {
      $return[$ch[$index]] = curl_multi_getcontent($ch[$index]);
      curl_multi_remove_handle($mh, $ch[$index]);
      curl_close($ch[$index]);
    }
    curl_multi_close($mh);
    return $return;
  }
}


推荐答案

让我们


  • 使用curl_multi api(这是在PHP中唯一明智的选择)

  • use the curl_multi api (it's the only sane choice for doing this in PHP)

具有最大同时连接数限制,不要只为每个url创建一个连接(如果创建,则会出现内存不足或资源不足的错误一百万个并发连接。如果您同时创建一百万个连接,我什至都不相信超时错误)

have a max simultaneous connection limit, don't just create a connection for each url (you'll get out-of-memory or out-of-resource errors if you just create a million simultaneous connections. and i wouldn't even trust the timeout errors if you just created a million connections simultaneously)

仅获取标头,因为下载了身体会浪费时间和带宽

only fetch the headers, because downloading the body would be a waste of time and bandwidth

这是我的尝试:

// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

这里是一些测试代码

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, false));

返回

array(0) {
}

因为它们都超时了(1毫秒超时),并且禁用了失败原因报告功能(这是最后一个参数),

because they all timed out (1 millisecond timeout), and fail reason reporting was disabled (that's the last argument),

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, true));

返回

array(3) {
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
}

将超时限制增加到1000,我们得到

increasing the timeout limit to 1000 we get

var_dump(validate_urls($urls, 1000, 1000, true, false));

=

array(3) {
  [0]=>
  string(14) "www.google.com"
  [1]=>
  string(22) "https://www.google.com"
  [2]=>
  string(15) "www.example.org"
}

var_dump(validate_urls($urls, 1000, 1000, true, true));

=

array(3) {
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
}

等:)速度应取决于您的带宽和$ max_connections变量,该变量是可配置的。

and so on :) the speed should depend on your bandwidth and $max_connections variable, which is configurable.

这篇关于快速验证PHP中的大量URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆