Puppeteer 通过多个页面并行抓取 [英] Puppeteer parallel scraping via multiple pages

查看:295
本文介绍了Puppeteer 通过多个页面并行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想同时抓取多个 url,所以我使用 p-queue 来实现一个 Promise-queue.

I wanted to scrape multiple urls simultaneously, so I used p-queue to implement a Promise-queue.

例如,请参见下面的代码,使用 1 个浏览器和多个页面来完成这项工作.

For example, see the code below, uses 1 browser and multiple pages to do this job.

const queue = new PQueue({
    concurrency: 5
});

(
    async () => {
        let instance = await pptr.launch({
            headless: false,
        });

        // task processor function
        const createInstance = async (url) => {
            let page = await instance.newPage();
            await page.goto(email);

            // (PROBLEM) more operations go here
            ...

            return await page.close();
        }

        // add tasks to queue
        for (let url of urls) {
            queue.add(async () => createInstance(url))
        } 
    }
)()

问题是,确实可以通过多个页面同时打开多个url,但看起来只有浏览器聚焦的一个(并且只有一个)页面会继续执行操作(参见上面的代码更多操作在这里部分),其他页面(或标签)就会停止工作,除非我点击该页面专注于它.

The problem is that, indeed multiple urls could be open at the same time via multiple pages, but looks like only the one (and only one) page focused by the browser will continue doing the operations (see the above code more operations go here section), the other pages (or tabs) just stop working unless I click on that page to focus on it.

那么有没有办法同时运行所有页面?

So is there any workaround to run all the pages simultaneously?

推荐答案

我找到了为什么上面的代码不起作用,我不应该在 worker 函数之外 await instance,而是 await 里面,见下,

I found why the above code didn't work, I shouldn't await instance outside of the worker function, but await inside, see below,

(
    async () => {
        let instance = pptr.launch({  // don't await here
            headless: false,
        });

        // task processor function
        const createInstance = async (url) => {
            let real_instance = await instance;  // await here
            let page = await real_instance.newPage();
            await page.goto(email);

            // (PROBLEM) more operations go here
            ...

            return await page.close();
        }

        // add tasks to queue
        for (let url of urls) {
            queue.add(async () => createInstance(url))
        } 
    }
)()

这篇关于Puppeteer 通过多个页面并行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆