Puppeteer 通过多个页面并行抓取 [英] Puppeteer parallel scraping via multiple pages
问题描述
我想同时抓取多个 url,所以我使用 p-queue
来实现一个 Promise
-queue.
I wanted to scrape multiple urls simultaneously, so I used p-queue
to implement a Promise
-queue.
例如,请参见下面的代码,使用 1 个浏览器和多个页面来完成这项工作.
For example, see the code below, uses 1 browser and multiple pages to do this job.
const queue = new PQueue({
concurrency: 5
});
(
async () => {
let instance = await pptr.launch({
headless: false,
});
// task processor function
const createInstance = async (url) => {
let page = await instance.newPage();
await page.goto(email);
// (PROBLEM) more operations go here
...
return await page.close();
}
// add tasks to queue
for (let url of urls) {
queue.add(async () => createInstance(url))
}
}
)()
问题是,确实可以通过多个页面同时打开多个url,但看起来只有浏览器聚焦的一个(并且只有一个)页面会继续执行操作(参见上面的代码更多操作在这里
部分),其他页面(或标签)就会停止工作,除非我点击该页面专注于它.
The problem is that, indeed multiple urls could be open at the same time via multiple pages, but looks like only the one (and only one) page focused by the browser will continue doing the operations (see the above code more operations go here
section), the other pages (or tabs) just stop working unless I click on that page to focus on it.
那么有没有办法同时运行所有页面?
So is there any workaround to run all the pages simultaneously?
推荐答案
我找到了为什么上面的代码不起作用,我不应该在 worker 函数之外 await instance
,而是 await
里面,见下,
I found why the above code didn't work, I shouldn't await instance
outside of the worker function, but await
inside, see below,
(
async () => {
let instance = pptr.launch({ // don't await here
headless: false,
});
// task processor function
const createInstance = async (url) => {
let real_instance = await instance; // await here
let page = await real_instance.newPage();
await page.goto(email);
// (PROBLEM) more operations go here
...
return await page.close();
}
// add tasks to queue
for (let url of urls) {
queue.add(async () => createInstance(url))
}
}
)()
这篇关于Puppeteer 通过多个页面并行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!