Crawler4j与Jsoup一起用于Java中的页面爬行和解析 [英] Crawler4j vs. Jsoup for the pages crawling and parsing in Java

查看:213
本文介绍了Crawler4j与Jsoup一起用于Java中的页面爬行和解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取页面的内容并提取其中的特定部分。据我所知,这项任务至少有两种解决方案: Crawler4j Jsoup

I want to get the content of a page and extract the specific parts of it. As far as I know there are at least two solutions for such task: Crawler4j and Jsoup.

他们都能够检索页面内容并提取它的子部分。我唯一不明白的是它们之间的区别是什么?有一个类似的问题,标记为已回答:

Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I don't understand what is the difference between them? There is a similar question, which is marked as answered:


Crawler4j 是一个爬虫, Jsoup 是一个解析器。

Crawler4j is a crawler, Jsoup is a parser.

但我刚检查过, Jsoup 1.8.3 是除了解析功能外,还能够抓取页面,而 Crawler4j 不仅可以抓取页面,还可以解析其内容。

But I just checked, Jsoup 1.8.3 is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content.

因此,请您澄清 Crawler4j Jsoup 之间的区别?

Thus, can you, please, clarify the difference between Crawler4j and Jsoup?

推荐答案

抓取比仅检索单个URI的内容更重要。如果您只想检索某些页面的内容,那么使用 Crawler4J 等内容并没有什么好处。

Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J.

让我们来看一个例子。假设您要抓取网站。要求是:

Let's take a look at an example. Assume you want to crawl a website. The requirements would be:


  1. 提供基本URI(主页)

  2. 从中获取所有URI每个页面也检索它们的内容。

  3. 为你检索的每个URI递归移动。

  4. 仅检索此内部URI的内容网站(可能有外部URI引用另一个网站,我们不需要那些)。

  5. 避免循环抓取。页面A具有页面B(同一站点的)的URI。页面B具有页面A的URI,但我们已经检索到页面A的内容(关于页面有主页的链接页面,但我们已经获得了 Home 页面的内容,所以不要再访问它了。

  6. 抓取操作必须是多线程的

  7. 网站很大。它包含很多页面。我们只想从 Home 页面开始检索50个URI。

  1. Give base URI (home page)
  2. Take all the URIs from each page and retrieve the contents of those too.
  3. Move recursively for every URI you retrieve.
  4. Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
  5. Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the About page has a link for the Home page, but we already got the contents of Home page so don't visit it again).
  6. The crawling operation must be multithreaded
  7. The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from Home page.

这是一个简单的场景。尝试使用 Jsoup 解决此问题。所有这些功能必须由您实施。对于这个问题,Crawler4J或任何爬虫微框架将会或者应该具有上述操作的实现。 Jsoup 当您决定如何处理内容时,强大的品质会闪耀。

This is a simple scenario. Try solving this with Jsoup. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. Jsoup's strong qualities shine when you get to decide what to do with the content.

让我们来看看在解析的一些要求。

Let's take a look at some requirements for parsing.


  1. 获取页面的所有段落

  2. 获取所有图像

  3. 删除无效标签(不符合 HTML 规格的标签)

  4. 删除脚本标签

  1. Get all paragraphs of a page
  2. Get all images
  3. Remove invalid tags (tags that do not comply to the HTML specs)
  4. Remove script tags

这是 Jsoup 来玩的地方。当然,这里有一些重叠。使用 Crawler4J Jsoup 可能会有一些事情,但这并不能使它们等效。您可以从 Jsoup 中删除​​检索内容的机制,仍然是一个很棒的工具。如果 Crawler4J 将删除检索,那么它将失去一半的功能。

This is where Jsoup comes to play. Of course, there is some overlapping here. Some things might be possible with both Crawler4J or Jsoup, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from Jsoup and still be an amazing tool to use. If Crawler4J would remove the retrieval, then it would lose half of its functionality.

我在现实场景中的同一个项目中使用了它们。
我抓了一个网站,利用 Crawler4J 的优点,解决了第一个例子中提到的所有问题。然后我将检索到的每个页面的内容传递给 Jsoup ,以便提取我需要的信息。我可以没用过其中一个吗?是的,我可以,但我必须实现所有缺少的功能。

I used both of them in the same project in a real life scenario. I crawled a site, leveraging the strong points of Crawler4J, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to Jsoup, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.

因此差异, Crawler4J 是一个爬虫,有一些简单的解析操作(你可以将图像提取到一个line),但没有复杂的 CSS 查询的实现。 Jsoup 是一个解析器,它为 HTTP 请求提供了一个简单的API。对于任何更复杂的事情都没有实现。

Hence the difference, Crawler4J is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS queries. Jsoup is a parser that gives you a simple API for HTTP requests. For anything more complex there is no implementation.

这篇关于Crawler4j与Jsoup一起用于Java中的页面爬行和解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆