Groovy中的Crawler(JSoup VS Crawler4j) [英] Crawler in Groovy (JSoup VS Crawler4j)

查看:109
本文介绍了Groovy中的Crawler(JSoup VS Crawler4j)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在Groovy(使用Grails框架和MongoDB数据库)中开发一个Web爬网程序,该爬网程序具有爬网网站,创建站点URL及其资源类型,内容,响应时间和重定向次数的列表的功能.涉及.

我正在就JSoup与Crawler4j进行辩论.我已经阅读了它们的基本操作,但是我无法清楚地了解两者之间的区别.任何人都可以建议使用上述功能哪个更好?还是将两者进行比较是完全不正确的?

谢谢.

解决方案

Crawler4J 是爬虫, Jsoup 是解析器.实际上,您可以/应该同时使用两者. Crawler4J是一个易于使用的多线程接口,用于获取所需站点的所有URL和所有页面(内容).之后,您可以使用Jsoup来解析数据,并使用惊人的(类似于jquery的)css选择器,并实际上对其进行处理.当然,您必须考虑动态(由javascript生成)的内容.如果您也想要该内容,则必须使用其他包含JavaScript引擎(无头浏览器+解析器)的东西,例如 htmlunit webdriver (硒),它们将在解析内容之前执行javascript.

I wish to develop a web crawler in Groovy(using Grails framework and MongoDB database) that has the ability to crawl a website, creating a list of site URLs and their resource types, their content, the response times and number of redirects involved.

I am debating over JSoup vs Crawler4j. I have read about what they basically do but I cannot understand clearly the difference between the two. Can anyone suggest which would be a better one for the above functionality? Or is it totally incorrect to compare the two?

Thanks.

解决方案

Crawler4J is a crawler, Jsoup is a parser. Actually you could/should use both. Crawler4J is an easy-multithreaded interface to get all the urls and all the pages(content) of the site you want. After that you can use Jsoup in order to parse the data, with amazing (jquery-like) css selectors and actually do something with it. Of course you have to consider dynamic (javascript generated) content. If you want that content too, then you have to use something else that includes a javascript engine (headless browser + parser) like htmlunit or webdriver (selenium), that will execute javascript before parsing the content.

这篇关于Groovy中的Crawler(JSoup VS Crawler4j)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆