Node.js:带有URL列表的异步请求 [英] Nodejs: Async request with a list of URL

查看:83
本文介绍了Node.js:带有URL列表的异步请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究履带.我有一个需要请求的URL列表.如果我未将其设置为异步的,则同时存在数百个请求.恐怕它会激增我的带宽,或导致对目标网站的大量网络访问.我该怎么办?

I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don't set it to be async. I am afraid that it would explode my bandwidth or produce to much network access to the target website. What should I do?

这是我在做什么:

urlList.forEach((url, index) => {

    console.log('Fetching ' + url);
    request(url, function(error, response, body) {
        //do sth for body

    });
});

我希望在一个请求完成后调用一个请求.

I want one request is called after one request is completed.

推荐答案

您需要注意的事情是:

  1. 目标站点是否有速率限制,如果您尝试请求太快太快,可能会阻止您访问该站点?

  1. Whether the target site has rate limiting and you may be blocked from access if you try to request too much too fast?

目标站点可以处理多少个并发请求而不会降低其性能?

How many simultaneous requests the target site can handle without degrading its performance?

您的服务器端有多少带宽?

How much bandwidth your server has on its end of things?

您自己的服务器可以同时处理和处理多少个请求,而不会导致过多的内存使用或CPU挂住.

How many simultaneous requests your own server can have in flight and process without causing excess memory usage or a pegged CPU.

通常,管理所有这一切的方案是创建一种方法来调整您启动的请求数.可以通过同时请求数,每秒请求数,使用的数据量等多种方式来控制此问题.

In general, the scheme for managing all this is to create a way to tune how many requests you launch. There are many different ways to control this by number of simultaneous requests, number of requests per second, amount of data used, etc...

最简单的开始方法是仅控制您同时发出的请求数量.可以这样完成:

The simplest way to start would be to just control how many simultaneous requests you make. That can be done like this:

function runRequests(arrayOfData, maxInFlight, fn) {
    return new Promise((resolve, reject) => {
        let index = 0;
        let inFlight = 0;

        function next() {
            while (inFlight < maxInFlight && index < arrayOfData.length) {
                ++inFlight;
                fn(arrayOfData[index++]).then(result => {
                    --inFlight;
                    next();
                }).catch(err => {
                    --inFlight;
                    console.log(err);
                    // purposely eat the error and let the rest of the processing continue
                    // if you want to stop further processing, you can call reject() here
                    next();
                });
            }
            if (inFlight === 0) {
                // all done
                resolve();
            }
        }
        next();
    });
}

然后,您将像这样使用它:

And, then you would use that like this:

const rp = require('request-promise');

// run the whole urlList, no more than 10 at a time
runRequests(urlList, 10, function(url) {
    return rp(url).then(function(data) {
        // process fetched data here for one url
    }).catch(function(err) {
        console.log(url, err);
    });
}).then(function() {
    // all requests done here
});

可以通过向其添加时间元素(每秒不超过N个请求),甚至向其添加带宽元素,来使其变得复杂.

This can be made as sophisticated as you want by adding a time element to it (no more than N requests per second) or even a bandwidth element to it.

我希望在一个请求完成后调用一个请求.

I want one request is called after one request is completed.

这是做事的很慢的方法.如果确实需要,则可以将maxInFlight参数的1传递给上述函数,但是通常情况下,事情会更快得多,并且允许5到50个并发请求不会引起问题.只有测试才能告诉您特定目标站点,特定服务器基础结构的最佳位置以及需要对结果进行的处理量.

That's a very slow way to do things. If you really want that, then you can just pass a 1 for the maxInFlight parameter to the above function, but typically, things would work a lot faster and not cause problems by allowing somewhere between 5 and 50 simultaneous requests. Only testing would tell you where the sweet spot is for your particular target sites and your particular server infrastructure and amount of processing you need to do on the results.

这篇关于Node.js:带有URL列表的异步请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆