保存完整的网页 [英] Save full webpage

查看:88
本文介绍了保存完整的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个项目中工作时遇到了一个问题.我想抓取"某些感兴趣的网站,并将它们保存为包括样式和图像的完整网页",以便为它们建立镜像.我几次想在网站上加上书签以便以后阅读,几天后该网站被关闭了,因为它被黑了,而且所有者没有数据库的备份.

I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the website was down because it got hacked and the owner didn't have a backup of the database.

当然,我可以很容易地使用fopen("http://website.com", "r")fsockopen()用php读取文件,但是主要目标是保存完整的网页,以便万一它掉了,仍然可以像其他人一样使用编程时间机器":)

Of course, I can read the files with php very easily with fopen("http://website.com", "r") or fsockopen() but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)

有没有一种方法可以不读取并保存页面上的每个链接?

Is there a way to do this without read and save each and every link on the page?

Objective-C解决方案也很受欢迎,因为我也在尝试找出更多解决方案.

Objective-C solutions are also welcome since I'm trying to figure out more of it also.

谢谢!

推荐答案

您实际上需要解析html和所有引用的CSS文件,这并不容易.但是,一种快速的方法是使用诸如wget之类的外部工具.安装wget之后,您可以从命令行运行 wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html

You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html

这将下载mypage.html和所有链接的CSS文件,图像以及在CSS内部链接的那些图像. 在系统上安装 wget 后,可以使用php的system()函数以编程方式控制 wget .

This will download the mypage.html and all linked css files, images and those images linked inside css. After installing wget on your system you could use php's system() function to control programmatically wget.

注意::至少需要 wget 1.12才能正确保存通过css文件引用的图像.

NOTE: You need at least wget 1.12 to properly save images that are references through css files.

这篇关于保存完整的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆