如何从另一个网站“获取"内容 [英] How to 'Grab' content from another website

查看:78
本文介绍了如何从另一个网站“获取"内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个朋友问我这个,我无法回答.

A friend has asked me this, and I couldn't answer.

他问:我正在制作此网站,您可以在其中存档您的网站...

He asked: I am making this site where you can archive your site...

它的工作原理是,您输入诸如something.com这样的网站,然后我们的网站抓取该网站上的内容(例如图片),然后将所有内容上传到我们的网站.这样,即使支持something.com的服务器已关闭,人们也可以在oursite.com/something.com上查看该站点的确切副本.

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

他怎么能做到这一点? (php?),会有什么要求?

How could he do this? (php?) and what would be some requirements?

推荐答案

听起来您需要创建一个网络爬虫. Web搜寻器可以用任何语言编写,尽管我建议为此使用C ++(使用cURL),Java(使用URLConnection)或Python(带urrlib2).您可能还可以与curl或wget命令以及BASH一起快速破解某些东西,尽管这可能不是最佳的长期解决方案.另外,请不要忘记,每当爬网某人的网站时,如果存在"robots.txt"文件,则应下载,解析并遵守该文件.

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

这篇关于如何从另一个网站“获取"内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆