NodeJS HTTP请求队列 [英] NodeJS HTTP Request Queue

查看:676
本文介绍了NodeJS HTTP请求队列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用puppeteer&创建刮板节点js(快速).这个想法是当服务器收到http请求时,我的应用程序将开始抓取页面.

问题是,如果我的应用程序一次收到多个http请求.搜寻过程将一遍又一遍地进行,直到没有http请求命中为止.如何仅启动一个http请求,并将另一个请求排队,直到第一个抓取过程完成?

当前,我尝试使用以下代码尝试 node-request-que 没有运气.

 var express = require("express");
var app = express();
var reload = require("express-reload");
var bodyParser = require("body-parser");
const router = require("./routes");
const RequestQueue = require("node-request-queue");

app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());

var port = process.env.PORT || 8080;

app.use(express.static("public")); // static assets eg css, images, js

let rq = new RequestQueue(1);

rq.on("resolved", res => {})
  .on("rejected", err => {})
  .on("completed", () => {});

rq.push(app.use("/wa", router));

app.listen(port);
console.log("Magic happens on port " + port); 

解决方案

node-request-queue是为request软件包创建的,与express不同.

您可以使用最简单的承诺队列库 p队列完成队列.它具有并发支持,并且比任何其他库更具可读性.您以后可以轻松地将承诺从Promise切换到诸如bull的强大队列.

这是创建队列的方法,

const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });

这是将异步函数添加到队列的方法,如果您监听它,它将返回已解析的数据,

queue.add(() => scrape(url));

因此,无需添加路由到队列,只需删除周围的其他行并保持路由器不变即可.

// here goes one route
app.use('/wa', router);

在您的一个路由器文件中,

const routes = require("express").Router();

const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });

// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');

async function queueScraper(url) {
  return queue.add(() => scrape(url));
}

routes.post("/", async (req, res) => {
  const result = await queueScraper(req.body.url);
  res.status(200).json(result);
});

module.exports = routes;

请确保将队列包括在路线内,而不要反过来.在routes文件或任何运行抓取工具的位置上仅创建一个队列.

这是抓取文件的内容,您可以使用任何想要的内容,这只是一个有效的虚拟对象,

const puppeteer = require('puppeteer');

// a dummy scraper function
// launches a browser and gets title
async function scrape(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const title = await page.title();
  await browser.close();
  return title
}

module.exports = scrape;

使用curl的结果:

这是我的git repo ,其中包含具有示例队列的工作代码.

警告

如果您使用任何这样的队列,您将注意到您同时处理100个结果时遇到问题,并且由于队列中还有其他99个url,因此对您的api的请求将不断超时.因此,您以后必须学习更多有关真实队列和并发的信息.

一旦您了解了队列的工作原理,那么有关cluster-puppeteer,rabbitMQ,bull队列等的其他答案将对您有所帮助:).

I've created scraper using puppeteer & node js (express). The idea is when server received http request then my apps will start scraping the page.

The problem is if my apps receive multiple http request at one time. Scraping process will start over and over again until no http request hit. How do i start only one http request and queue the other request until the first scraping process finish ?

Currently, i've tried node-request-queue with codes below but no lucks.

var express = require("express");
var app = express();
var reload = require("express-reload");
var bodyParser = require("body-parser");
const router = require("./routes");
const RequestQueue = require("node-request-queue");

app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());

var port = process.env.PORT || 8080;

app.use(express.static("public")); // static assets eg css, images, js

let rq = new RequestQueue(1);

rq.on("resolved", res => {})
  .on("rejected", err => {})
  .on("completed", () => {});

rq.push(app.use("/wa", router));

app.listen(port);
console.log("Magic happens on port " + port);

解决方案

node-request-queue is created for request package, which is different than express.

You can accomplish the queue using simplest promise queue library p-queue. It has concurrency support and looks much more readable than any other libraries. You can easily switch away from promises to a robust queue like bull at a later time.

This is how you can create a queue,

const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });

This is how you can add an async function to queue, it will return resolved data if you listen to it,

queue.add(() => scrape(url));

So instead of adding route to queue, you just remove other lines around it and keep the router as is.

// here goes one route
app.use('/wa', router);

Inside one of your router file,

const routes = require("express").Router();

const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });

// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');

async function queueScraper(url) {
  return queue.add(() => scrape(url));
}

routes.post("/", async (req, res) => {
  const result = await queueScraper(req.body.url);
  res.status(200).json(result);
});

module.exports = routes;

Make sure to include the queue inside the route, not the other way around. Create only one queue on your routes file or wherever you are running the scrapper.

Here is contents of scraper file, you can use any content you want, this is just an working dummy,

const puppeteer = require('puppeteer');

// a dummy scraper function
// launches a browser and gets title
async function scrape(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const title = await page.title();
  await browser.close();
  return title
}

module.exports = scrape;

Result using curl:

Here is my git repo which have working code with sample queue.

Warning

If you use any of such queue, you will notice you have problem dealing with 100 of results at same time and request to your api will keep timing out because there are 99 other url waiting in the queue. That is why you have to learn more about real queue and concurrency at a later time.

Once you understand how queue works, the other answers about cluster-puppeteer, rabbitMQ, bull queue etc, will help you at that time :) .

这篇关于NodeJS HTTP请求队列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆