在循环内调用控制器(crawler4j-3.5) [英] calling controller(crawler4j-3.5) inside loop

查看:158
本文介绍了在循环内调用控制器(crawler4j-3.5)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我在 for-loop 中调用控制器,因为我有超过100个网址,所以我我有所有列表,我将迭代和抓取页面,我也设置了setCustomData的url,因为它不应该离开域。

Hi I am calling controller inside for-loop, because I am having more than 100 url, so I am having all in list and I will iterate and crawl the page, I set that url for setCustomData also, because it should not leave the domain.

for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) {
    String str = iterator.next();
    System.out.println("cheking"+str);
    CrawlController controller = new CrawlController(config, pageFetcher,
        robotstxtServer);
    controller.setCustomData(str);
    controller.addSeed(str);
    controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers);
    controller.waitUntilFinish();
}

但如果我运行上面的代码,则在第二个网址之后第一个网址完全抓取后开始和打印错误如下。

but if I run above code, after 1st url crawled perfectly after that 2nd url getting started and printing error like below.

50982 [main] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - Crawler 1 started.
51982 [Crawler 1] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection request: [route: {}->http://www.connectzone.in][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 100]
60985 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - It looks like no thread is working, waiting for 10 seconds to make sure...
70986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
80986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - All of the crawlers are stopped. Finishing the process...
80987 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - Waiting for 10 seconds before final clean up...
91050 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection manager is shutting down
91051 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection manager shut down

请帮我解决上面的解决方案,我想开始并在循环中运行控制器,因为我有很多列表中的网址。

please help me to solve the above solution, my interating to start and run the controller inside loop, because I am having lot of url in list.

注意:**我正在使用** crawler4j-3.5.jar 及其依赖项。

推荐答案

尝试:

for(String url : urls) {
    controller.addSeed(url);
}

并覆盖 shouldVisit(WebUrl)以便它不能离开域。

and override shouldVisit(WebUrl) so that it can't leave the domains.

这篇关于在循环内调用控制器(crawler4j-3.5)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆