阅读网站的最佳方式? [英] Best way to go about reading a website?

查看:65
本文介绍了阅读网站的最佳方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图创建一个程序,从网站抓取数据x次,我正在寻找一种方式,这样做,没有巨大的延迟的过程。

I'm trying to create a program that grabs data from a website x amount of times and I'm looking for a way to go about doing so without huge delays in the process.

目前我使用下面的代码,它相当慢(即使它只抓取4个人的名字,我希望一次做约100):

Currently I use the following code, and it's rather slow (even though it is only grabbing 4 peoples' names, I'm expecting to do about 100 at a time):

$skills = array(
    "overall", "attack", "defense", "strength", "constitution", "ranged",
    "prayer", "magic", "cooking", "woodcutting", "fletching", "fishing",
    "firemaking", "crafting", "smithing", "mining", "herblore", "agility",
    "thieving", "slayer", "farming", "runecrafting", "hunter", "construction",
    "summoning", "dungeoneering"
);

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");//explode("\r\n", $_POST['names']);

$skill = isset($_GET['skill']) ? array_search($skills, $_GET['skill']) : 0;

display($participants, $skills, array_search($_GET['skill'], $skills));

function getAllStats($participants) {
    $stats = array();
    for ($i = 0; $i < count($participants); $i++) {
        $stats[] = getStats($participants[$i]);
    }
    return $stats;
}

function display($participants, $skills, $stat) {
    $all = getAllStats($participants);
    for ($i = 0; $i < count($participants); $i++) {
        $rank = getSkillData($all[$i], 0, $stat);
        $level = getSkillData($all[$i], 1, $stat);
        $experience = getSkillData($all[$i], 3, $stat);
    }
}

function getStats($username) {
    $curl = curl_init("http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
    curl_setopt ($curl, CURLOPT_HEADER, (int) $header);
    curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt ($curl, CURLOPT_VERBOSE, 1);
    $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
    $output = curl_exec($curl);
    curl_close ($curl);
    if (strstr($output, "<html><head><title>")) {
        return false;
    }
    return $output;
}

function getSkillData($stats, $row, $skill) {
    $stats = explode("\n", $stats);
    $levels = explode(",", $stats[$skill]);
    return $levels[$row];
}

当我进行基准测试时,大约需要5秒钟, >太不好,但想象如果我这样做了93次以上。我明白这不会是即时的,但我想拍不到30秒。我知道这是可能的,因为我看到做类似的网站,他们在30秒的时间内行动。

When I benchmarked this it took about 5 seconds, which isn't too bad, but imagine if I was doing this 93 more times. I understand it won't be instant, but I'd like to shoot for under 30 seconds. I know it's possible because I've seen websites which do something similar and they act within a 30 second time period.

我读过关于使用缓存数据,但是将不会工作,因为,简单地说,它会老。我使用数据库(进一步,我还没有到那部分)存储旧的数据和检索新的数据,这将是实时的(你看到下面)。

I've read about using caching the data but that won't work because, simply, it will be old. I'm using a database (further on, I haven't gotten to that part yet) to store old data and retrieve new data which will be real time (what you see below).

有没有办法做到这样的事情没有大的延迟(可能重载我正在阅读的服务器)?

Is there a way to achieve doing something like this without massive delays (and possibly overloading the server I am reading from)?

PS:网站I我从阅读只是文本,它没有任何HTML解析,其中应该减少加载时间。这里有一个页面的样子(它们都是相同的,只是不同的数字):

69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1

P.S: The website I am reading from is just text, it doesn't have any HTML to parse, which should reduce the loading time. Here's an example of what a page looks like (they're all the same, just different numbers):
69,2496,1285458634 10982,99,33055154 6608,99,30955066 6978,99,40342518 12092,99,36496288 13247,99,21606979 2812,99,13977759 926,99,36988378 415,99,153324269 329,99,59553081 472,99,40595060 2703,99,28297122 281,99,36937100 1017,99,19418910 276,99,27539259 792,99,34289312 3040,99,16675156 82,99,39712827 80,99,104504543 2386,99,21236188 655,99,28714439 852,99,30069730 29,99,200000000 3366,99,15332729 2216,99,15836767 154,120,200000000 -1,-1 -1,-1 -1,-1 -1,-1 -1,-1 30086,2183 54640,1225 89164,1028 123432,1455 -1,-1 -1,-1

我之前使用此方法的基准与 curl_multi_exec

My previous benchmark with this method vs. curl_multi_exec:

function getTime() { 
    $timer = explode(' ', microtime()); 
    $timer = $timer[1] + $timer[0]; 
    return $timer; 
}

function benchmarkFunctions() {
    $start = getTime();
    old_f();
    $end = getTime();
    echo 'function old_f() took ' . round($end - $start, 4) . ' seconds to complete<br><br>';
    $startt = getTime();
    new_f();
    $endd = getTime();
    echo 'function new_f() took ' . round($endd - $startt, 4) . ' seconds to complete';
}

function old_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    getAllStats($test);
}

function new_f() {
    $test = array("A E T", "Ts Danne", "Funkymunky11", "Fast993", "Fast99Three", "Jeba", "Quuxx");
    $curl_arr = array();
    $master = curl_multi_init();

    $amt = count($test);
    for ($i = 0; $i < $amt; $i++) {
        $curl_arr[$i] = curl_init('http://hiscore.runescape.com/index_lite.ws?player=' . $test[$i]);
        curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
        curl_multi_add_handle($master, $curl_arr[$i]);
    }

    do {
        curl_multi_exec($master, $running);
    } while ($running > 0);

    for ($i = 0; $i < $amt; $i++) {
        $results = curl_exec($curl_arr[$i]);
    }
}


推荐答案

此外,我更改了您的代码以检查 httpCode ,而不是使用 strstr 。应该更快。

You can reuse curl connections. Also, I changed your code to check the httpCode instead of using strstr. Should be quicker.

此外,您可以设置curl并行,这是我从来没有尝试过的。请参见 http://www.php.net/manual/en/ function.curl-multi-exec.php

Also, you can setup curl to do it in parallel, which I've never tried. See http://www.php.net/manual/en/function.curl-multi-exec.php

改进的 getStats()重复使用curl句柄。

An improved getStats() with reused curl handle.

function getStats(&$curl,$username) {
    curl_setopt($curl, CURLOPT_URL, "http://hiscore.runescape.com/index_lite.ws?player=" . $username);
    $output = curl_exec($curl);
    if (curl_getinfo($curl, CURLINFO_HTTP_CODE)!='200') {
        return null;
    }
    return $output;
}

用法: b

$participants = array("Zezima", "Allar", "Foot", "Arma150", "Green098", "Skiller 703", "Quuxx");

$curl = curl_init();
curl_setopt ($curl, CURLOPT_CONNECTTIMEOUT, 0); //dangerous! will wait indefinitely
curl_setopt ($curl, CURLOPT_USERAGENT, sprintf("Mozilla/%d.0", rand(4, 5)));
curl_setopt ($curl, CURLOPT_HEADER, false);
curl_setopt ($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($curl, CURLOPT_VERBOSE, 1);
//try:
curl_setopt($curl, CURLOPT_HTTPHEADER, array(
    'Connection: Keep-Alive',
    'Keep-Alive: 300'
));


header('Content-type:text/plain');
foreach($participants as &$user) {
    $stats =  getStats($curl, $user);
    if($stats!==null) {
        echo $stats."\r\n";
    }
}

curl_close($curl);

这篇关于阅读网站的最佳方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆