crawler4j 获取数据的步骤顺序是什么? [英] What sequence of steps does crawler4j follow to fetch data?

查看:16
本文介绍了crawler4j 获取数据的步骤顺序是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习,

  1. crawler4j 是如何工作的?
  2. 它是否获取网页,然后下载其内容并提取它?
  3. .db 和 .cvs 文件及其结构怎么样?

<块引用>

一般来说,它遵循什么顺序?

拜托,我想要一个描述性的内容

谢谢

解决方案

通用爬虫流程

一个典型的多线程爬虫的流程如下:

  1. 我们有一个队列数据结构,称为frontier.新发现的 URL(或起点,所谓的种子)被添加到这个数据结构中.此外,为每个 URL 分配一个唯一 ID,以确定之前是否访问过给定 URL.

  2. Crawler 线程然后从 frontier 获取 URL 并安排它们以供以后处理.

  3. 实际处理开始:

    • 确定并解析给定 URL 的 robots.txt 以遵守排除标准并成为礼貌的网络爬虫(可配置)
    • 接下来,线程将检查礼貌性,即在再次访问 URL 的同一主机之前等待的时间.
    • 抓取工具访问实际 URL 并下载内容(这实际上可以是所有内容)
    • 如果我们有 HTML 内容,则解析该内容并提取潜在的新 URL 并将其添加到前沿(在 crawler4j 中,这可以通过 shouldVisit(...)).
  4. 重复整个过程,直到没有新的 URL 添加到 frontier.

通用(专注)爬虫架构

除了 crawler4j 的实现细节之外,一个或多或少的通用(专注)爬虫架构(在单个服务器/PC 上)如下所示:

免责声明:图片是我自己的作品.请参考这篇文章以尊重这一点.

I'd like to learn,

  1. how crawler4j works?
  2. Does it fetch web page then download its content and extract it ?
  3. What about .db and .cvs file and its structures?

Generally ,What sequences it follows?

please, I want a descriptive content

Thanks

解决方案

General Crawler Process

The process for a typical multi-threaded crawler is as follows:

  1. We have a queue data structure, which is called frontier. Newly discovered URLs (or start points, so-called seeds) are added to this datastructure. In addition, for every URL a unique ID is assigned in order to determine, if a given URL was previously visited.

  2. Crawler threads then obtain URLs from the frontier and schedule them for later processing.

  3. The actual processing starts:

    • The robots.txt for the given URL is determined and parsed to honour exclusion criteria and be a polite web-crawler (configurable)
    • Next, the thread will check for politeness, i.e. time to wait before visting the same host of an URL again.
    • The actual URL is vistied by the crawler and the content is downloaded (this can be literally everything)
    • If we have HTML content, this content is parsed and potential new URLs are extracted and added to the frontier (in crawler4j this can be controlled via shouldVisit(...)).
  4. The whole process is repeated until no new URLs are added to the frontier.

General (Focused) Crawler Architecture

Besides the implementation details of crawler4j a more or less general (focused) crawler architecture (on a single server/pc) looks like this:

Disclaimer: Image is my own work. Please respect this by referencing this post.

这篇关于crawler4j 获取数据的步骤顺序是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆