Apache Nutch 未将网页中的内部链接添加到 fetchlist [英] Apache Nutch not adding internal links in a web page to fetchlist

查看:57
本文介绍了Apache Nutch 未将网页中的内部链接添加到 fetchlist的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Apache Nutch 1.7,我在使用 URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 作为种子 URL,这个 URL 在页面中有很多内部链接,也有很多外部链接链接到其他域,我只对内部链接感兴趣.

I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.

然而,当这个页面被抓取时,其中的内部链接不会在下一轮抓取中添加(我给出了深度为 100).我已经将 db.ignore.internal.links 设置为 false ,但由于某种原因,内部链接没有被添加到下一轮获取列表中.

However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.

另一方面,如果我将 db.ignore.external.links 设置为 false,它会正确地从页面中获取所有外部链接.

On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.

这个问题在任何其他域中都不存在,有人能告诉我这个特定页面是什么吗?

This problem is not present in any other domains , can some tell me what is it with this particular page ?

我还附上了我用于您审查的 nucth-site.xml,请指教.

I have also attached the nucth-site.xml that I am using for your review, please advise.

推荐答案

默认过滤器会忽略您的种子网址,因此不会抓取您的页面.

Your seed url is being ignored by the default filters, so your page is not being crawled.

编辑以下文件:

conf/automaton-urlfilter.txt

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

conf/regex-urlfilter.txt

替换

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*

这篇关于Apache Nutch 未将网页中的内部链接添加到 fetchlist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆