阅读网站的最佳方式? [英] Best way to go about reading a website?
问题描述
我试图创建一个程序,从网站抓取数据x次,我正在寻找一种方式,这样做,没有巨大的延迟的过程。
I'm trying to create a program that grabs data from a website x amount of times and I'm looking for a way to go about doing so without huge delays in the process.
目前我使用下面的代码,它相当慢(即使它只抓取4个人的名字,我希望一次做约100):
Currently I use the following code, and it's rather slow (even though it is only grabbing 4 peoples' names, I'm expecting to do about 100 at a time):
$skills = array(
"overall", "attack", "defense", "strength", "constitution", "ranged",
"prayer", "magic", "cooking", "woodcutting", "fletching", "fishing",
"firemaking", "crafting", "smithing", "mining", "herblore", "agility",
"thieving", "slayer", "farming", "runecrafting", "hunter", "construction",
"summoning", "dungeoneering"
);
$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");//explode("\r\n", $_POST['names']);
$skill = isset($_GET['skill']) ? array_search($skills, $_GET['skill']) : 0;
display($participants, $skills, array_search($_GET['skill'], $skills));
function getAllStats($participants) {
$stats = array();
for ($i = 0; $i < count($participants); $i++) {
$stats[] = getStats($participants[$i]);
}
return $stats;
}
function display($participants, $skills, $stat) {
$all = getAllStats($participants);
for ($i = 0; $i < count($participants); $i++) {
$rank = getSkillData($all[$i], 0, $stat);
$level = getSkillData($all[$i], 1, $stat);
$experience = getSkillData($all[$i], 3, $stat);
}
}
function getStats($username) {
$curl = curl_init("http://hiscore.runescape.com/index_lite.ws?player=" . $username);
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, (int) $header);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
$httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
$output = curl_exec($curl);
curl_close ($curl);
if (strstr($output, "<html><head><title>")) {
return false;
}
return $output;
}
function getSkillData($stats, $row, $skill) {
$stats = explode("\n", $stats);
$levels = explode(",", $stats[$skill]);
return $levels[$row];
}
当我进行基准测试时,大约需要5秒钟, >太不好,但想象如果我这样做了93次以上。我明白这不会是即时的,但我想拍不到30秒。我知道这是可能的,因为我看到做类似的网站,他们在30秒的时间内行动。
When I benchmarked this it took about 5 seconds, which isn't too bad, but imagine if I was doing this 93 more times. I understand it won't be instant, but I'd like to shoot for under 30 seconds. I know it's possible because I've seen websites which do something similar and they act within a 30 second time period.
我读过关于使用缓存数据,但是将不会工作,因为,简单地说,它会老。我使用数据库(进一步,我还没有到那部分)存储旧的数据和检索新的数据,这将是实时的(你看到下面)。
I've read about using caching the data but that won't work because, simply, it will be old. I'm using a database (further on, I haven't gotten to that part yet) to store old data and retrieve new data which will be real time (what you see below).
有没有办法做到这样的事情没有大的延迟(可能重载我正在阅读的服务器)?
Is there a way to achieve doing something like this without massive delays (and possibly overloading the server I am reading from)?
PS:网站I我从阅读只是文本,它没有任何HTML解析,其中应该减少加载时间。这里有一个页面的样子(它们都是相同的,只是不同的数字):
69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1
P.S: The website I am reading from is just text, it doesn't have any HTML to parse, which should reduce the loading time. Here's an example of what a page looks like (they're all the same, just different numbers):
69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1
我之前使用此方法的基准与 curl_multi_exec
:
My previous benchmark with this method vs. curl_multi_exec
:
function getTime() {
$timer = explode(' ', microtime());
$timer = $timer[1] + $timer[0];
return $timer;
}
function benchmarkFunctions() {
$start = getTime();
old_f();
$end = getTime();
echo 'function old_f() took ' . round($end - $start, 4) . ' seconds to complete<br><br>';
$startt = getTime();
new_f();
$endd = getTime();
echo 'function new_f() took ' . round($endd - $startt, 4) . ' seconds to complete';
}
function old_f() {
$test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
getAllStats($test);
}
function new_f() {
$test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
$curl_arr = array();
$master = curl_multi_init();
$amt = count($test);
for ($i = 0; $i < $amt; $i++) {
$curl_arr[$i] = curl_init('http://hiscore.runescape.com/index_lite.ws?player=' . $test[$i]);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master, $running);
} while ($running > 0);
for ($i = 0; $i < $amt; $i++) {
$results = curl_exec($curl_arr[$i]);
}
}
推荐答案
此外,我更改了您的代码以检查 httpCode
,而不是使用 strstr
。应该更快。
You can reuse curl connections. Also, I changed your code to check the httpCode
instead of using strstr
. Should be quicker.
此外,您可以设置curl并行,这是我从来没有尝试过的。请参见 http://www.php.net/manual/en/ function.curl-multi-exec.php
Also, you can setup curl to do it in parallel, which I've never tried. See http://www.php.net/manual/en/function.curl-multi-exec.php
改进的 getStats()
重复使用curl句柄。
An improved getStats()
with reused curl handle.
function getStats(&$curl,$username) {
curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
$output = curl_exec($curl);
if (curl_getinfo($curl, CURLINFO_HTTP_CODE)!='200') {
return null;
}
return $output;
}
用法: b
$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");
$curl = curl_init();
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 0); //dangerous! will wait indefinitely
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
//try:
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
'Connection: Keep-Alive',
'Keep-Alive: 300'
));
header('Content-type:text/plain');
foreach($participants as &$user) {
$stats = getStats($curl, $user);
if($stats!==null) {
echo $stats."\r\n";
}
}
curl_close($curl);
这篇关于阅读网站的最佳方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!