如何下载完整的网站? [英] How to download a full website?
问题描述
修复网站代码使用CDN后(将所有url重写为图片,js & css),我需要测试域上的所有页面,以确保所有资源都是从 CDN 获取的.
After fixing the code of a website to use a CDN (rewriting all the urls to images, js & css), I need to test all the pages on the domain to make sure all the resources are fetched from the CDN.
所有网站页面都可以通过链接访问,没有孤立的页面.
All the sites pages are accessible through links, no isolated pages.
目前我正在使用 FireBug 并检查网络"视图...
Currently I'm using FireBug and checking the "Net" view...
有没有什么自动化的方法可以给一个域名并请求该域的所有页面+资源?
Is there some automated way to give a domain name and request all pages + resources of the domain?
更新:
好的,我发现我可以这样使用 wget
:
OK, I found I can use wget
as so:
wget -p --no-cache -e robots=off -m -H -D cdn.domain.com,www.domain.com -o site1.log www.domain.com
选项说明:
-p
- 也下载资源(图片、css、js 等)--no-cache
- 获取真实对象,不返回服务器缓存对象-e robots=off
- 忽略robots
和no-follow
方向-m
- 镜像站点(点击链接)-H
- 跨主机(也跟随其他域)-D cdn.domain.com,www.domain.com
- 指定要关注的女域名,否则将关注页面中的每个链接-o site1.log
- 记录到文件 site1.log-U "Mozilla/5.0"
- 可选:伪造用户代理 - 如果服务器为不同的浏览器返回不同的数据,则很有用www.domain.com
- 要下载的站点
-p
- download resources too (images, css, js, etc.)--no-cache
- get the real object, do not return server cached object-e robots=off
- disregardrobots
andno-follow
directions-m
- mirror site (follow links)-H
- span hosts (follow other domains too)-D cdn.domain.com,www.domain.com
- specify witch domains to follow, otherwise will follow every link from the page-o site1.log
- log to file site1.log-U "Mozilla/5.0"
- optional: fake the user agent - useful if server returns different data for different browserwww.domain.com
- the site to download
享受吧!
推荐答案
wget
文档中包含以下内容:
The wget
documentation has this bit in it:
实际上,要下载单个页面及其所有必需品(即使它们存在于不同的网站上),并确保该批次显示在本地正确,作者喜欢另外使用一些选项到‘-p’:
Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:
wget -E -H -k -K -p http://site/document
关键是-H
选项,意思是--span-hosts ->递归时转到外部主机
.我不知道这是否也代表普通超链接或仅代表资源,但您应该尝试一下.
The key is the -H
option, which means --span-hosts -> go to foreign hosts when recursive
. I don't know if this also stands for normal hyperlinks or only for resources, but you should try it out.
您可以考虑另一种策略.您不需要下载资源来测试它们是否从 CDN 引用.您只需获取您感兴趣的页面的源代码(您可以像以前一样使用 wget
或 curl
或其他东西),并且:
You can consider an alternate strategy. You don't need to download the resources to test that they are referenced from the CDN. You can just get the source code for the pages you're interested in (you can use wget
, as you did, or curl
, or something else) and either:
- 使用库解析它 - 哪个取决于您用于编写脚本的语言.检查每个
、
和
的 CDN 链接.
- 使用正则表达式检查资源 url 是否包含 CDN 域.看这个:),虽然在这种有限的情况下,它可能不会过于复杂.
- parse it using a library - which one depends on the language you're using for scripting. Check each
<img />
,<link />
and<script />
for CDN links. - use regexes to check that the resource urls contain the CDN domain. See this :), although in this limited case it might not be overly complicated.
您还应该检查所有 CSS 文件的 url()
链接 - 它们还应该指向 CDN 图像.根据您的应用程序的逻辑,您可能需要检查 JavaScript 代码是否不会创建任何非来自 CDN 的图像.
You should also check all CSS files for url()
links - they should also point to CDN images. Depending on the logic of your apllication, you may need to check that the JavaScript code does not create any images that do not come from the CDN.
这篇关于如何下载完整的网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!