履带式与刮板式 [英] crawler vs scraper

查看:36
本文介绍了履带式与刮板式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人能在范围和功能方面区分爬虫和抓取工具吗.

Can somebody distinguish between a crawler and scraper in terms of scope and functionality.

推荐答案

爬虫获取网页——即,给定一个起始地址(或一组起始地址)和一些条件(例如,要访问多少链接,要忽略的文件类型)它会从起点下载链接到的任何内容.

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

抓取工具获取已下载的页面,或者更一般意义上的已格式化以供显示的数据,并(尝试)从这些页面中提取数据,以便(例如)将其存储在数据库中并根据需要进行操作.

A scraper takes pages that have been downloaded or, in a more general sense, data that's formatted for display, and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

根据您使用结果的方式,抓取可能会侵犯信息所有者的权利和/或有关使用网站的用户协议(在某些情况下,抓取也会违反后者).许多站点在其根目录中包含一个名为 robots.txt 的文件(即具有 URL http://server/robots.txt) 来指定爬虫应该如何(以及是否)处理该站点——特别是,它可以列出爬虫不应尝试访问的(部分)URL.如果需要,可以为每个爬虫(用户代理)单独指定这些.

Depending on how you use the result, scraping may well violate the rights of the owner of the information and/or user agreements about use of web sites (crawling violates the latter in some cases as well). Many sites include a file named robots.txt in their root (i.e. having the URL http://server/robots.txt) to specify how (and if) crawlers should treat that site -- in particular, it can list (partial) URLs that a crawler should not attempt to visit. These can be specified separately per crawler (user-agent) if desired.

这篇关于履带式与刮板式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆