php - 检查许多域中文本存在的最快方式(1000以上) [英] php - Fastest way to check presence of text in many domains (above 1000)
问题描述
我有一个php脚本运行,并使用cURL检索网页的内容,我想检查一些文本的存在。
I have a php script running and using cURL to retrieve the content of webpages on which I would like to check for the presence of some text.
现在看起来像这样:
for( $i = 0; $i < $num_target; $i++ ) {
$ch = curl_init();
$timeout = 10;
curl_setopt ($ch, CURLOPT_URL,$target[$i]);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt ($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$url = curl_exec ($ch);
curl_close($ch);
if (preg_match($text,$url,$match)) {
$match[$i] = $match;
echo "text" . $text . " found in URL: " . $url . ": " . $match .;
} else {
$match[$i] = $match;
echo "text" . $text . " not found in URL: " . $url . ": no match";
}
}
我想知道是否可以使用特殊的cURL设置这使它更快(我看了php手册选择看起来最好的选项,但我可能已经忽略了一些,可以提高脚本的速度和性能)。
I was wondering if I could use a special cURL setup that makes it faster ( I looked in the php manual chose the options that seemed the best to me but I may have neglected some that could increase the speed and performance of the script).
我当时想知道是否使用cgi,Perl或python(或另一个解决方案)可能比php更快。
I was then wondering if using cgi, Perl or python (or another solution) could be faster than php.
感谢您提供任何帮助/建议。
Thank you in advance for any help / advice / suggestion.
推荐答案
您可以使用 curl_multi_init
....并行处理多个cURL句柄。
You can use curl_multi_init
.... which Allows the processing of multiple cURL handles in parallel.
示例
$url = array();
$url[] = 'http://www.huffingtonpost.com';
$url[] = 'http://www.yahoo.com';
$url[] = 'http://www.google.com';
$url[] = 'http://technet.microsoft.com/en-us/';
$start = microtime(true);
echo "<pre>";
print_r(checkLinks($url, "Azure"));
echo "<h1>", microtime(true) - $start, "</h1>";
输出
Array
(
[0] => http://technet.microsoft.com/en-us/
)
1.2735739707947 <-- Faster
使用的功能
function checkLinks($nodes, $text) {
$mh = curl_multi_init();
$curl_array = array();
foreach ( $nodes as $i => $url ) {
$curl_array[$i] = curl_init($url);
curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_array[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)');
curl_setopt($curl_array[$i], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($curl_array[$i], CURLOPT_TIMEOUT, 15);
curl_multi_add_handle($mh, $curl_array[$i]);
}
$running = NULL;
do {
usleep(10000);
curl_multi_exec($mh, $running);
} while ( $running > 0 );
$res = array();
foreach ( $nodes as $i => $url ) {
$curlErrorCode = curl_errno($curl_array[$i]);
if ($curlErrorCode === 0) {
$info = curl_getinfo($curl_array[$i]);
if ($info['http_code'] == 200) {
if (stripos(curl_multi_getcontent($curl_array[$i]), $text) !== false) {
$res[] = $info['url'];
}
}
}
curl_multi_remove_handle($mh, $curl_array[$i]);
curl_close($curl_array[$i]);
}
curl_multi_close($mh);
return $res;
}
这篇关于php - 检查许多域中文本存在的最快方式(1000以上)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!