管理 puppeteer 的内存和性能 [英] Managing puppeteer for memory and performance

查看:433
本文介绍了管理 puppeteer 的内存和性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 puppeteer 来抓取一些页面,但我很好奇如何管理它在节点应用程序的生产中.我一天最多可以抓取 500,000 页,但是这些抓取工作会随机发生,所以我无法通过一个队列.

I'm using puppeteer for scraping some pages, but I'm curious about how to manage this in production for a node app. I'll be scraping up to 500,000 pages in a day, but these scrape jobs will happen at random intervals, so it's not a single queue that I can plow through.

我想知道的是,在每个作业之间打开浏览器,转到页面,然后关闭浏览器是否更好?我认为这会慢很多,但也许可以更好地处理内存?

What I'm wondering is, is it better to open a browser, go to the page, then close the browser between each job? Which I would assume would be a lot slower, but maybe handle memory better?

或者我是否在应用程序启动时打开一个全局浏览器,然后转到该页面,并在我完成后通过某种方式转储该页面(例如关闭 chrome 中的所有选项卡,但不关闭 chrome) 然后在需要时重新打开一个新页面?这种方式看起来会更快,但可能会占用大量内存.

Or do I open one global browser when the app boots, and then just go to the page, and have some way to dump that page when I'm done with it (e.g. closing all tabs in chrome, but not closing chrome) then just re-open a new page when I need it? This way seems like it would be faster, but could potentially eat up lots of memory.

我从来没有使用过这个库,尤其是在生产环境中,所以我不确定是否有需要注意的地方.

I've never worked with this library especially in a production environment, so I'm not sure if there's things I should watch out for.

推荐答案

如果您每天抓取 500,000 页(大约每 0.1728 秒一页),那么我建议在现有浏览器会话中打开一个新页面,而不是为每个页面打开一个新的浏览器会话.

If you are scraping 500,000 pages per day (approximately one page every 0.1728 seconds), then I would recommend opening a new page in an existing browser session rather than opening a new browser session for each page.

您可以打开和关闭页面 使用以下方法:

You can open and close a Page using the following method:

const page = await browser.newPage();
await page.close();

如果您决定使用一个浏览器 对于您的项目,我会确保实施错误处理程序,以确保如果程序崩溃,您在创建新的页面浏览器,或 浏览器上下文.

If you decide to use one Browser for your project, I would make sure to implement error handling procedures to ensure that if the program crashes, you have minimal downtime while you create a new Page, Browser, or BrowserContext.

这篇关于管理 puppeteer 的内存和性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆