Crawler4j与Jsoup一起用于Java中的页面爬行和解析 [英] Crawler4j vs. Jsoup for the pages crawling and parsing in Java
问题描述
我想获取页面的内容并提取其中的特定部分。据我所知,这项任务至少有两种解决方案: Crawler4j 和 Jsoup 。
I want to get the content of a page and extract the specific parts of it. As far as I know there are at least two solutions for such task: Crawler4j and Jsoup.
他们都能够检索页面内容并提取它的子部分。我唯一不明白的是它们之间的区别是什么?有一个类似的问题,标记为已回答:
Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I don't understand what is the difference between them? There is a similar question, which is marked as answered:
Crawler4j
是一个爬虫,Jsoup
是一个解析器。
Crawler4j
is a crawler,Jsoup
is a parser.
但我刚检查过, Jsoup 1.8.3
是除了解析功能外,还能够抓取页面,而 Crawler4j
不仅可以抓取页面,还可以解析其内容。
But I just checked, Jsoup 1.8.3
is also capable crawling a page in addition to a parsing functionality, while Crawler4j
is capable not only crawling the page but parsing its content.
因此,请您澄清 Crawler4j
和 Jsoup
之间的区别?
Thus, can you, please, clarify the difference between Crawler4j
and Jsoup
?
推荐答案
抓取比仅检索单个URI的内容更重要。如果您只想检索某些页面的内容,那么使用 Crawler4J
等内容并没有什么好处。
Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J
.
让我们来看一个例子。假设您要抓取网站。要求是:
Let's take a look at an example. Assume you want to crawl a website. The requirements would be:
- 提供基本URI(主页)
- 从中获取所有URI每个页面也检索它们的内容。
- 为你检索的每个URI递归移动。
- 仅检索此内部URI的内容网站(可能有外部URI引用另一个网站,我们不需要那些)。
- 避免循环抓取。页面A具有页面B(同一站点的)的URI。页面B具有页面A的URI,但我们已经检索到页面A的内容(
关于
页面有主页的链接
页面,但我们已经获得了Home
页面的内容,所以不要再访问它了。 - 抓取操作必须是多线程的
- 网站很大。它包含很多页面。我们只想从
Home
页面开始检索50个URI。
- Give base URI (home page)
- Take all the URIs from each page and retrieve the contents of those too.
- Move recursively for every URI you retrieve.
- Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
- Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the
About
page has a link for theHome
page, but we already got the contents ofHome
page so don't visit it again). - The crawling operation must be multithreaded
- The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from
Home
page.
这是一个简单的场景。尝试使用 Jsoup
解决此问题。所有这些功能必须由您实施。对于这个问题,Crawler4J或任何爬虫微框架将会或者应该具有上述操作的实现。 Jsoup
当您决定如何处理内容时,强大的品质会闪耀。
This is a simple scenario. Try solving this with Jsoup
. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. Jsoup
's strong qualities shine when you get to decide what to do with the content.
让我们来看看在解析的一些要求。
Let's take a look at some requirements for parsing.
- 获取页面的所有段落
- 获取所有图像
- 删除无效标签(不符合
HTML
规格的标签) - 删除脚本标签
- Get all paragraphs of a page
- Get all images
- Remove invalid tags (tags that do not comply to the
HTML
specs) - Remove script tags
这是 Jsoup
来玩的地方。当然,这里有一些重叠。使用 Crawler4J
或 Jsoup
可能会有一些事情,但这并不能使它们等效。您可以从 Jsoup
中删除检索内容的机制,仍然是一个很棒的工具。如果 Crawler4J
将删除检索,那么它将失去一半的功能。
This is where Jsoup
comes to play. Of course, there is some overlapping here. Some things might be possible with both Crawler4J
or Jsoup
, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from Jsoup
and still be an amazing tool to use. If Crawler4J
would remove the retrieval, then it would lose half of its functionality.
我在现实场景中的同一个项目中使用了它们。
我抓了一个网站,利用 Crawler4J
的优点,解决了第一个例子中提到的所有问题。然后我将检索到的每个页面的内容传递给 Jsoup
,以便提取我需要的信息。我可以没用过其中一个吗?是的,我可以,但我必须实现所有缺少的功能。
I used both of them in the same project in a real life scenario.
I crawled a site, leveraging the strong points of Crawler4J
, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to Jsoup
, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.
因此差异, Crawler4J
是一个爬虫,有一些简单的解析操作(你可以将图像提取到一个line),但没有复杂的 CSS
查询的实现。 Jsoup
是一个解析器,它为 HTTP
请求提供了一个简单的API。对于任何更复杂的事情都没有实现。
Hence the difference, Crawler4J
is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS
queries. Jsoup
is a parser that gives you a simple API for HTTP
requests. For anything more complex there is no implementation.
这篇关于Crawler4j与Jsoup一起用于Java中的页面爬行和解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!