在 Crawler4j 中循环调用 Controller.Start? [英] Calling Controller.Start in loop in Crawler4j?

查看：25 发布时间：2021/9/22 20:32:23 java web-crawler crawler4j

本文介绍了在 Crawler4j 中循环调用 Controller.Start?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在这里问了一个问题.但这是另一个听起来相似的问题.

I asked one question here. But this is kind of other question that sounds similar.

使用crawler4j，我想抓取多个限制域名的种子网址(即shouldVisit中的域名检查).这里是一个如何操作的示例.简而言之，您使用 customData 设置域名列表，然后将其传递给爬虫类(来自控制器)，在 shouldVisit 函数中，我们循环遍历此数据(这是一个列表，请参阅链接的 url)以查看域名是否在列表，如果是，则返回 true.

Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it. In short, you set list of domain names using customData and then pass it to crawler class (from controller) and in shouldVisit function, we loop through this data (which is a list, see linked url) to see if domain name is there in list, if so return true.

这里有一个小故障.如果 google.com 和 yahoo.com 在种子 URL 域列表的名称中并且 www.yahoo.com/xyz 链接到 www.google.com/zyx ，它将抓取该页面，因为 www.google.com 在那里在我们的域访问列表中.此外，如果种子网址的数量很大(数千个)，shouldVisit 中的 for 循环可能会很重，并且它也会消耗一些内存.

There is a glitch in this. If google.com and yahoo.com are there in the names of seed url domain list and www.yahoo.com/xyz links to www.google.com/zyx , it will crawl the page, because www.google.com is there in our domains-to-visit list. Also, a for loop in shouldVisit could be heavy if number of seed urls is huge (thousands) and it will consume some memory as well.

为了解决这个问题，我可以想到遍历种子网址.它可能是这样的:

To counter this, I can think of a looping through seed urls. This is how it may look like :

while(s.next()){
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(some-seed-url);
controller.setCustomData(domain-name-of-seed-url-to-be-checked-in-shouldVisit);
controller.start(MyCrawler.class, numberOfCrawlers);    


}

我不确定这是否是一个糟糕的主意，但是在性能方面这样做有什么优势/劣势吗?其他问题?

I am not sure if this is a terrible idea, but is there any advantage/disadvantage of doing it in performance terms ? other concerns ?

我对其进行了测试，似乎这种方法消耗了太多时间(可能是在每个循环中打开和关闭控制器实例).希望有其他解决方案.

I tested it, and it seems like this approach consumes too much time (probably in opening and closing instances of controller in each loop.) Wish there is some other solution.

在 Crawler4j 中循环调用 Controller.Start? [英] Calling Controller.Start in loop in Crawler4j?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在 Crawler4j 中循环调用 Controller.Start? [英] Calling Controller.Start in loop in Crawler4j?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭