Puppeteer:如何下载整个网页以供离线使用 [英] Puppeteer: how to download entire web page for offline use

查看:54
本文介绍了Puppeteer:如何下载整个网页以供离线使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 Google 的 Puppeteer 抓取整个网站,使其所有 CSS/JavaScript/媒体都完好无损(而不仅仅是其 HTML)?在其他抓取工作上成功尝试后,我想它应该可以.

How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to.

但是,翻阅了网上很多优秀的例子,并没有明显的方法.我能找到的最接近的是调用

However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling

html_contents = await page.content()

并保存结果,但这会保存一个没有任何非 HTML 元素的副本.

and saving the results, but that saves a copy without any non-HTML elements.

有没有办法保存网页以供 Puppeteer 离线使用?

Is there way to save webpages for offline use with Puppeteer?

推荐答案

目前可以通过实验性 CDP 调用 'Page.captureSnapshot' 使用 MHTML 格式:

It is currently possible via experimental CDP call 'Page.captureSnapshot' using MHTML format:

'use strict';

const puppeteer = require('puppeteer');
const fs = require('fs');

(async function main() {
  try {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    await page.goto('https://en.wikipedia.org/wiki/MHTML');

    const cdp = await page.target().createCDPSession();
    const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
    fs.writeFileSync('page.mhtml', data);

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

这篇关于Puppeteer:如何下载整个网页以供离线使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆