保存完整的网页 [英] Save full webpage
问题描述
在一个项目中工作时遇到了一个问题.我想抓取"某些感兴趣的网站,并将它们保存为包括样式和图像的完整网页",以便为它们建立镜像.我几次想在网站上加上书签以便以后阅读,几天后该网站被关闭了,因为它被黑了,而且所有者没有数据库的备份.
I've bumped into a problem while working at a project. I want to "crawl" certain websites of interest and save them as "full web page" including styles and images in order to build a mirror for them. It happened to me several times to bookmark a website in order to read it later and after few days the website was down because it got hacked and the owner didn't have a backup of the database.
当然,我可以很容易地使用fopen("http://website.com", "r")
或fsockopen()
用php读取文件,但是主要目标是保存完整的网页,以便万一它掉了,仍然可以像其他人一样使用编程时间机器":)
Of course, I can read the files with php very easily with fopen("http://website.com", "r")
or fsockopen()
but the main target is to save the full web pages so in case it goes down, it can still be available to others like a "programming time machine" :)
有没有一种方法可以不读取并保存页面上的每个链接?
Is there a way to do this without read and save each and every link on the page?
Objective-C解决方案也很受欢迎,因为我也在尝试找出更多解决方案.
Objective-C solutions are also welcome since I'm trying to figure out more of it also.
谢谢!
推荐答案
您实际上需要解析html和所有引用的CSS文件,这并不容易.但是,一种快速的方法是使用诸如wget之类的外部工具.安装wget之后,您可以从命令行运行
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html
You actually need to parse the html and all css files that are referenced, which is NOT easy. However a fast way to do it is to use an external tool like wget. After installing wget you could run from the command line
wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://example.com/mypage.html
这将下载mypage.html和所有链接的CSS文件,图像以及在CSS内部链接的那些图像.
在系统上安装 wget 后,可以使用php的system()
函数以编程方式控制 wget .
This will download the mypage.html and all linked css files, images and those images linked inside css.
After installing wget on your system you could use php's system()
function to control programmatically wget.
注意::至少需要 wget 1.12才能正确保存通过css文件引用的图像.
NOTE: You need at least wget 1.12 to properly save images that are references through css files.
这篇关于保存完整的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!