crawler4j 获取数据的步骤顺序是什么? [英] What sequence of steps does crawler4j follow to fetch data?
问题描述
我想学习,
- crawler4j 是如何工作的?
- 它是否获取网页,然后下载其内容并提取它?
- .db 和 .cvs 文件及其结构怎么样?
<块引用>
一般来说,它遵循什么顺序?
拜托,我想要一个描述性的内容
谢谢
通用爬虫流程
一个典型的多线程爬虫的流程如下:
我们有一个队列数据结构,称为
frontier
.新发现的 URL(或起点,所谓的种子)被添加到这个数据结构中.此外,为每个 URL 分配一个唯一 ID,以确定之前是否访问过给定 URL.Crawler 线程然后从
frontier
获取 URL 并安排它们以供以后处理.实际处理开始:
- 确定并解析给定 URL 的
robots.txt
以遵守排除标准并成为礼貌的网络爬虫(可配置) - 接下来,线程将检查礼貌性,即在再次访问 URL 的同一主机之前等待的时间.
- 抓取工具访问实际 URL 并下载内容(这实际上可以是所有内容)
- 如果我们有 HTML 内容,则解析该内容并提取潜在的新 URL 并将其添加到前沿(在
crawler4j
中,这可以通过shouldVisit(...)代码>).
- 确定并解析给定 URL 的
重复整个过程,直到没有新的 URL 添加到
frontier
.
通用(专注)爬虫架构
除了 crawler4j
的实现细节之外,一个或多或少的通用(专注)爬虫架构(在单个服务器/PC 上)如下所示:
免责声明:图片是我自己的作品.请参考这篇文章以尊重这一点.
I'd like to learn,
- how crawler4j works?
- Does it fetch web page then download its content and extract it ?
- What about .db and .cvs file and its structures?
Generally ,What sequences it follows?
please, I want a descriptive content
Thanks
General Crawler Process
The process for a typical multi-threaded crawler is as follows:
We have a queue data structure, which is called
frontier
. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.Crawler threads then obtain URLs from the
frontier
and schedule them for later processing.The actual processing starts:
- The
robots.txt
for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable) - Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
- The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
- If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in
crawler4j
this can be controlled viashouldVisit(...)
).
- The
The whole process is repeated until no new URLs are added to the
frontier
.
General (Focused) Crawler Architecture
Besides the implementation details of crawler4j
a more or less general (focused) crawler architecture (on a single server/pc) looks like this:
Disclaimer: Image is my own work. Please respect this by referencing this post.
这篇关于crawler4j 获取数据的步骤顺序是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!