在 Crawler4j 中循环调用 Controller.Start? [英] Calling Controller.Start in loop in Crawler4j?

查看:25
本文介绍了在 Crawler4j 中循环调用 Controller.Start?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里问了一个问题.但这是另一个听起来相似的问题.

I asked one question here. But this is kind of other question that sounds similar.

使用crawler4j,我想抓取多个限制域名的种子网址(即shouldVisit中的域名检查).这里是一个如何操作的示例.简而言之,您使用 customData 设置域名列表,然后将其传递给爬虫类(来自控制器),在 shouldVisit 函数中,我们循环遍历此数据(这是一个列表,请参阅链接的 url)以查看域名是否在列表,如果是,则返回 true.

Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it. In short, you set list of domain names using customData and then pass it to crawler class (from controller) and in shouldVisit function, we loop through this data (which is a list, see linked url) to see if domain name is there in list, if so return true.

这里有一个小故障.如果 google.com 和 yahoo.com 在种子 URL 域列表的名称中并且 www.yahoo.com/xyz 链接到 www.google.com/zyx ,它将抓取该页面,因为 www.google.com 在那里在我们的域访问列表中.此外,如果种子网址的数量很大(数千个),shouldVisit 中的 for 循环可能会很重,并且它也会消耗一些内存.

There is a glitch in this. If google.com and yahoo.com are there in the names of seed url domain list and www.yahoo.com/xyz links to www.google.com/zyx , it will crawl the page, because www.google.com is there in our domains-to-visit list. Also, a for loop in shouldVisit could be heavy if number of seed urls is huge (thousands) and it will consume some memory as well.

为了解决这个问题,我可以想到遍历种子网址.它可能是这样的:

To counter this, I can think of a looping through seed urls. This is how it may look like :

while(s.next()){
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(some-seed-url);
controller.setCustomData(domain-name-of-seed-url-to-be-checked-in-shouldVisit);
controller.start(MyCrawler.class, numberOfCrawlers);    


}

我不确定这是否是一个糟糕的主意,但是在性能方面这样做有什么优势/劣势吗?其他问题?

I am not sure if this is a terrible idea, but is there any advantage/disadvantage of doing it in performance terms ? other concerns ?

我对其进行了测试,似乎这种方法消耗了太多时间(可能是在每个循环中打开和关闭控制器实例).希望有其他解决方案.

I tested it, and it seems like this approach consumes too much time (probably in opening and closing instances of controller in each loop.) Wish there is some other solution.

推荐答案

尝试我在相关主题中找到的解决方案:

try the solution I found in a related subject:

从 3.0 版本开始,此功能在 crawler4j 中实现.请访问 http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ 示例用法.

As of version 3.0, this feature is implemented in crawler4j. Please visit http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ for an example usage.

基本上,您需要以非阻塞模式启动控制器:

Basically, you need to start the controller in non-blocking mode:

controller.startNonBlocking(MyCrawler.class, numberOfThreads);

然后您可以在循环中添加种子.请注意,您不需要循环多次启动控制器.

希望有帮助!

这篇关于在 Crawler4j 中循环调用 Controller.Start?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆