Apache的Nutch的：没有要抓取的网址 - 检查你的种子名单和URL过滤器 [英] Apache Nutch: No URLs to fetch - check your seed list and URL filters

查看：694 发布时间：2016/5/21 14:00:03 apache nutch

本文介绍了Apache的Nutch的：没有要抓取的网址 - 检查你的种子名单和URL过滤器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的Nutch 1.2。当我运行爬命令，像这样：

 斌/ Nutch的抓取网址抓取-dir 2 -depth 1000 -topN喷油器：开始于2011-07-11 12时18分37秒
喷油器：crawlDb：抓取/ crawldb
喷油器：urlDir：网址
喷油器：转换注入的网址抓取数据库条目。
注射：注射合并到的URL抓取分贝。
喷油器：收于2011-07-11十二时18分44秒，经过：00:00:07
发电机：起始于2011-07-11 12时18分45秒
发电机：选择最好的得分的URL到期获取。
发电机：过滤：真
发电机：正火：真
发电机：TOPN：1000
发电机：JobTracker的是本地，确切地产生一个分区。
发电机：0记录选定取，退出...
停在深度= 0  - 没有更多的要抓取的网址。
**没有网址可获取 - 检查你的种子名单和URL过滤器**。
完成抓取：抓取

问题是，它让抱怨：没有要抓取的网址 - 检查你的种子名单和URL过滤器

我有一个网址列表的nutch_root /网址/ Nutch的文件下爬行。我爬-urlfilter.txt也设置。

为什么它会抱怨我的URL列表和过滤器？它从来没有这样做过。

下面是我的爬行urlfilter.txt

 ＃跳过文件：，FTP :,＆安培;的mailto：网址
 -  ^（文件| FTP |邮寄地址）：＃跳过图像和其他后缀，我们还不能解析
-\\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$＃跳过含有特定字符作为可能的查询，网址等。
＃跳过的URL与重复3次以上斜杠分隔段，打破循环
 -  *（/ [^ /] +）/ [^ /] + \\ 1 / [^ /] + \\ 1 /＃接受MY.DOMAIN.NAME主机
+ ^ HTTP：//（[A-Z0-9] * \\）* 152.111.1.87 /
+ ^ HTTP：//（[A-Z0-9] * \\）* 152.111.1.88 /＃跳过一切
 - 。

解决方案

您URL过滤规则看起来很怪异，我不认为他们匹配有效的网址，这样的事情应该是没有更好的？

+ ^ HTTP：// 152 \\ 0.111 \\ .1 \\ 0.87 /结果 + ^ HTTP：// 152 \\ 0.111 \\ .1 \\ 0.88 /

I'm using nutch 1.2. When I run the crawl command like so:

bin/nutch crawl urls -dir crawl -depth 2 -topN 1000

Injector: starting at 2011-07-11 12:18:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-11 12:18:44, elapsed: 00:00:07
Generator: starting at 2011-07-11 12:18:45
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
**No URLs to fetch - check your seed list and URL filters.**
crawl finished: crawl

The problem is that it keeps complaining about the: No URLs to fetch - check your seed list and URL filters.

I have a list of urls to crawl under the nutch_root/urls/nutch file. my crawl-urlfilter.txt is also set.

Why would it complain about my url list and filters? it never did this before.

Here is my crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.


# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*152.111.1.87/
+^http://([a-z0-9]*\.)*152.111.1.88/

# skip everything else
-.

解决方案

Your URL filter rules look weird and I don't think they match valid URLs, something like this should be better no ?

+^http://152\.111\.1\.87/ +^http://152\.111\.1\.88/

这篇关于Apache的Nutch的：没有要抓取的网址 - 检查你的种子名单和URL过滤器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache的Nutch的：没有要抓取的网址 - 检查你的种子名单和URL过滤器 [英] Apache Nutch: No URLs to fetch - check your seed list and URL filters

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

Apache的Nutch的：没有要抓取的网址 - 检查你的种子名单和URL过滤器 [英] Apache Nutch: No URLs to fetch - check your seed list and URL filters

问题描述

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭