如何下载完整的网站? [英] How to download a full website?

查看:43
本文介绍了如何下载完整的网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

修复网站代码使用CDN后(将所有url重写为图片,js & css),我需要测试域上的所有页面,以确保所有资源都是从 CDN 获取的.

After fixing the code of a website to use a CDN (rewriting all the urls to images, js & css), I need to test all the pages on the domain to make sure all the resources are fetched from the CDN.

所有网站页面都可以通过链接访问,没有孤立的页面.

All the sites pages are accessible through links, no isolated pages.

目前我正在使用 FireBug 并检查网络"视图...

Currently I'm using FireBug and checking the "Net" view...

有没有什么自动化的方法可以给一个域名并请求该域的所有页面+资源?

Is there some automated way to give a domain name and request all pages + resources of the domain?

更新:

好的,我发现我可以这样使用 wget:

OK, I found I can use wget as so:

wget -p --no-cache -e robots=off -m -H -D cdn.domain.com,www.domain.com -o site1.log www.domain.com

选项说明:

  • -p - 也下载资源(图片、css、js 等)
  • --no-cache - 获取真实对象,不返回服务器缓存对象
  • -e robots=off - 忽略 robotsno-follow 方向
  • -m - 镜像站点(点击链接)
  • -H - 跨主机(也跟随其他域)
  • -D cdn.domain.com,www.domain.com - 指定要关注的女域名,否则将关注页面中的每个链接
  • -o site1.log - 记录到文件 site1.log
  • -U "Mozilla/5.0" - 可选:伪造用户代理 - 如果服务器为不同的浏览器返回不同的数据,则很有用
  • www.domain.com - 要下载的站点
  • -p - download resources too (images, css, js, etc.)
  • --no-cache - get the real object, do not return server cached object
  • -e robots=off - disregard robots and no-follow directions
  • -m - mirror site (follow links)
  • -H - span hosts (follow other domains too)
  • -D cdn.domain.com,www.domain.com - specify witch domains to follow, otherwise will follow every link from the page
  • -o site1.log - log to file site1.log
  • -U "Mozilla/5.0" - optional: fake the user agent - useful if server returns different data for different browser
  • www.domain.com - the site to download

享受吧!

推荐答案

wget 文档中包含以下内容:

The wget documentation has this bit in it:

实际上,要下载单个页面及其所有必需品(即使它们存在于不同的网站上),并确保该批次显示在本地正确,作者喜欢另外使用一些选项到‘-p’:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:

      wget -E -H -k -K -p http://site/document

关键是-H选项,意思是--span-hosts ->递归时转到外部主机.我不知道这是否也代表普通超链接或仅代表资源,但您应该尝试一下.

The key is the -H option, which means --span-hosts -> go to foreign hosts when recursive. I don't know if this also stands for normal hyperlinks or only for resources, but you should try it out.

您可以考虑另一种策略.您不需要下载资源来测试它们是否从 CDN 引用.您只需获取您感兴趣的页面的源代码(您可以像以前一样使用 wgetcurl 或其他东西),并且:

You can consider an alternate strategy. You don't need to download the resources to test that they are referenced from the CDN. You can just get the source code for the pages you're interested in (you can use wget, as you did, or curl, or something else) and either:

  • 使用库解析它 - 哪个取决于您用于编写脚本的语言.检查每个
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆