httrack wget curl scrape&取 [英] httrack wget curl scrape & fetch

查看:244
本文介绍了httrack wget curl scrape&取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

互联网上有许多工具可以下载网站的静态副本,例如HTTrack。还有许多工具,一些商业,用于从网站,如Mozenda刮内容。然后有一些工具,显然内置在像PHP和* nix这样的程序,你可以file_get_contents或wget或cURL或只是file()。



我对所有这一切完全困惑,我认为主要原因是我所遇到的描述都没有使用相同的词汇。在表面上,至少,似乎他们都在做同样的事情,但也许不是。



这是我的问题。这些工具做什么,确切?他们做同样的事情吗?他们通过不同的技术做同样的事情吗?

解决方案

首先,让我澄清镜像之间的区别

镜像是指下载网站的全部内容,或其中的某些突出部分(包括HTML,图像,脚本,CSS样式表等)。通常这样做是为了保留和扩展对有价值(通常是有限的)互联网资源的访问,或者添加额外的故障切换冗余。例如,许多大学和IT公司镜像各种Linux供应商的发布存档。镜像可能意味着您计划在自己的服务器上托管网站的副本(具有原始内容所有者的权限)。



抓取从网站复制和提取一些有趣的数据。与镜像不同,抓取目标是特定数据集(名称,电话号码,股票报价等),而不是网站的全部内容。例如,您可以刮除美国人口普查局的平均收入数据或Google财经的股票报价。



这两个可以组合在一起,以便将数据复制(镜像)与信息提取刮)关心。例如,如果数据的提取和分析缓慢或过程密集,您可能会发现它更快地镜像站点,然后刮取本地副本。



回答您的问题的其余部分...



file_get_contents $ c> file PHP函数用于从本地或远程计算机读取文件。该文件可以是HTML文件,也可以是其他文件,如文本文件或电子表格。这不是镜像或刮通常指的是什么,虽然你可以使用这些写自己的基于PHP的镜像/刮刀。



wget curl 是用于从远程服务器下载一个或多个文件的命令行独立程序,使用各种选项,条件和协议。两者都是令人难以置信的强大和流行的工具,主要的区别是 wget 有丰富的内置功能,用于镜像整个网站。



HTTrack 在其意图中类似于 wget ,但使用GUI而不是命令行。这使得它更容易用于那些不舒服的终端运行命令,代价是失去由 wget 提供的功能和灵活性。



您可以使用 HTTrack wget 进行镜像,但是您必须运行自己



Mozenda 是一个可以用来提取(刮除)信息的程序,刮刀,不同于 HTTrack wget curl 您要定位要提取的特定数据,而不是盲目地复制所有内容。但是,我没有什么经验。



我通常使用 wget 镜像我感兴趣的HTML页面,然后运行一个Ruby和R脚本的组合来提取和分析数据。


There are a number of tools on the internet for downloading a static copy of a website, such as HTTrack. There are also many tools, some commercial, for "scraping" content from a website, such as Mozenda. Then there are tools which are apparently built in to programs like PHP and *nix where you can "file_get_contents" or "wget" or "cURL" or just "file()".

I am thoroughly confused by all of this, and I think the main reason is that none of the descriptions I have come across use the same vocabulary. On the surface, at least, it seems like they are all doing the same thing, but maybe not.

That is my question. What are these tools doing, exactly? Are they doing the same thing? Are they doing the same thing via different technology? If they aren’t doing the same thing, how are they different?

解决方案

First, let me clarify the difference between "mirroring" and "scraping".

Mirroring refers to downloading the entire contents of a website, or some prominent section(s) of it (including HTML, images, scripts, CSS stylesheets, etc). This is often done to preserve and expand access to a valuable (and often limited) internet resource, or to add additional fail-over redundancy. For example, many universities and IT companies mirror various Linux vendors' release archives. Mirroring may imply that you plan on hosting a copy of the website on your own server (with the original content owner's permission).

Scraping refers to copying and extracting some interesting data from a website. Unlike mirroring, scraping targets a particular dataset (names, phone numbers, stock quotes, etc) rather than the entire contents of the site. For example, you could "scrape" average income data from the US Census Bureau or stock quotes from Google Finance. This is sometimes done against the terms and conditions of the host, making it illegal.

The two can be combined in order to separate data copying (mirroring) from information extraction (scraping) concerns. For example, you may find that its quicker to mirror a site, and then scrape your local copy if the extraction and analysis of the data is slow or process-intensive.

To answer the rest of your question...

file_get_contents and file PHP functions are for reading a file from a local or remote machine. The file may be an HTML file, or it could be something else, like a text file or a spreadsheet. This is not what either "mirroring" or "scraping" usually refers to, although you could write your own PHP-based mirror/scraper using these.

wget and curl are command-line stand-alone programs for downloading one or more files from remote servers, using a variety of options, conditions and protocols. Both are incredibly powerful and popular tools, the main difference being that wget has rich built-in features for mirroring entire websites.

HTTrack is similar to wget in its intent, but uses a GUI instead of a command-line. This makes it easier to use for those not comfortable running commands from a terminal, at the cost of losing the power and flexibility provided by wget.

You can use HTTrack and wget for mirroring, but you will have to run your own programs on the resulting downloaded data to extract (scrape) information, if that's your ultimate goal.

Mozenda is a scraper, which, unlike HTTrack, wget or curl allows you to target specific data to be extracted, rather than blindly copying all contents. I have little experience with it, however.

P.S. I usually use wget to mirror the HTML pages I'm interested in, and then run a combination of Ruby and R scripts to extract and analyze data.

这篇关于httrack wget curl scrape&取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆