限制URL仅限种子URL域crawler4j [英] Restricting URLs to seed URL domain only crawler4j

查看：196 发布时间：2019/1/9 22:48:40 java web-crawler crawler4j

本文介绍了限制URL仅限种子URL域crawler4j的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我希望crawler4j以这样的方式访问页面，使它们只属于种子中的域。种子中有多个域。我该怎么办？

I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it?

假设我要添加种子网址：

Suppose I am adding seed URLs:

www.google.com

www.yahoo.com

www.wikipedia.com

现在我开始抓取，但我希望我的抓取工具仅在以上三个域中访问页面（就像 shouldVisit（））。显然有外部链接，但我希望我的抓取工具仅限于这些域。子域，子文件夹是可以的，但不在这些域之外。

Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit()) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

限制URL仅限种子URL域crawler4j [英] Restricting URLs to seed URL domain only crawler4j

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

限制URL仅限种子URL域crawler4j [英] Restricting URLs to seed URL domain only crawler4j

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭