Puppeteer:如何下载整个网页以供离线使用 [英] Puppeteer: how to download entire web page for offline use

查看：54 发布时间：2021/12/17 14:21:47 javascript html css web-scraping puppeteer

本文介绍了Puppeteer:如何下载整个网页以供离线使用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用 Google 的 Puppeteer 抓取整个网站，使其所有 CSS/JavaScript/媒体都完好无损(而不仅仅是其 HTML)?在其他抓取工作上成功尝试后，我想它应该可以.

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

但是，翻阅了网上很多优秀的例子，并没有明显的方法.我能找到的最接近的是调用

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

并保存结果，但这会保存一个没有任何非 HTML 元素的副本.

and saving the results, but that saves a copy without any non-HTML elements.

有没有办法保存网页以供 Puppeteer 离线使用?

Is there way to save webpages for offline use with Puppeteer?

推荐答案

目前可以通过实验性 CDP 调用 'Page.captureSnapshot' 使用 MHTML 格式:

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

这篇关于Puppeteer:如何下载整个网页以供离线使用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Puppeteer:如何下载整个网页以供离线使用 [英] Puppeteer: how to download entire web page for offline use

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Puppeteer:如何下载整个网页以供离线使用 [英] Puppeteer: how to download entire web page for offline use

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭