Puppeteer:如何下载整个网页以供离线使用 [英] Puppeteer: how to download entire web page for offline use
问题描述
如何使用 Google 的 Puppeteer 抓取整个网站,使其所有 CSS/JavaScript/媒体都完好无损(而不仅仅是其 HTML)?在其他抓取工作上成功尝试后,我想它应该可以.
How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.
但是,翻阅了网上很多优秀的例子,并没有明显的方法.我能找到的最接近的是调用
However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling
html_contents = await page.content()
并保存结果,但这会保存一个没有任何非 HTML 元素的副本.
and saving the results, but that saves a copy without any non-HTML elements.
有没有办法保存网页以供 Puppeteer 离线使用?
Is there way to save webpages for offline use with Puppeteer?
推荐答案
目前可以通过实验性 CDP 调用 'Page.captureSnapshot'
使用 MHTML 格式:
It is currently possible via experimental CDP call 'Page.captureSnapshot'
using MHTML format:
'use strict';
const puppeteer = require('puppeteer');
const fs = require('fs');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://en.wikipedia.org/wiki/MHTML');
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
await browser.close();
} catch (err) {
console.error(err);
}
})();
这篇关于Puppeteer:如何下载整个网页以供离线使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!