限制URL仅限种子URL域crawler4j [英] Restricting URLs to seed URL domain only crawler4j

查看:196
本文介绍了限制URL仅限种子URL域crawler4j的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望crawler4j以这样的方式访问页面,使它们只属于种子中的域。种子中有多个域。我该怎么办?

I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it?

假设我要添加种子网址:

Suppose I am adding seed URLs:


  • www.google.com

  • www.yahoo.com

  • www.wikipedia.com

现在我开始抓取,但我希望我的抓取工具仅在以上三个域中访问页面(就像 shouldVisit())。显然有外部链接,但我希望我的抓取工具仅限于这些域。子域,子文件夹是可以的,但不在这些域之外。

Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit()) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.

推荐答案

如果您尝试将抓取工具限制为只有与种子网址具有相同域名的网址,然后:

If you are trying to restrict the crawler to only urls with the same domains as the seed urls, then:


  1. 从种子网址中提取域名。

  1. Extract the domain names from the seed URLs.

使用 shouldVisit WebCrawler ) $ c>过滤掉域中不在其中的任何URL的方法。

Write your crawler class (that extends WebCrawler) with a shouldVisit method to filter out any URLs whose domains are not in the set.

配置控制器,添加种子并以正常方式启动它。 ..根据示例此处

Configure the controller, add the seeds and start it in the normal way ... as per the example here.

这篇关于限制URL仅限种子URL域crawler4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆