如何从另一个网站“抓取"内容 [英] How to 'Grab' content from another website

查看:36
本文介绍了如何从另一个网站“抓取"内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个朋友问过我这个问题,我无法回答.

A friend has asked me this, and I couldn't answer.

他问:我正在制作这个网站,您可以在其中存档您的网站...

He asked: I am making this site where you can archive your site...

它的工作原理是这样的,您输入您的网站,例如 something.com,然后我们的网站会抓取该网站上的内容(如图片),然后将其上传到我们的网站.然后人们可以在 oursite.com/something.com 上查看该网站的精确副本,即使承载 something.com 的服务器已关闭.

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

他怎么能这样?(php?) 还有什么要求?

How could he do this? (php?) and what would be some requirements?

推荐答案

听起来您需要创建一个网络爬虫.网络爬虫可以用任何语言编写,但我建议使用 C++(使用 cURL)、Java(使用 URLConnection)或 Python(使用 urrlib2).您也可以使用 curl 或 wget 命令和 BASH 快速破解某些东西,尽管这可能不是最好的长期解决方案.另外,请不要忘记,只要您抓取某人的网站,就应该下载、解析并尊重robots.txt"文件(如果该文件存在).

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

这篇关于如何从另一个网站“抓取"内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆