Apache Nutch 未将网页中的内部链接添加到 fetchlist [英] Apache Nutch not adding internal links in a web page to fetchlist
问题描述
我使用的是 Apache Nutch 1.7,我在使用 URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 作为种子 URL,这个 URL 在页面中有很多内部链接,也有很多外部链接链接到其他域,我只对内部链接感兴趣.
I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.
然而,当这个页面被抓取时,其中的内部链接不会在下一轮抓取中添加(我给出了深度为 100).我已经将 db.ignore.internal.links 设置为 false ,但由于某种原因,内部链接没有被添加到下一轮获取列表中.
However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.
另一方面,如果我将 db.ignore.external.links 设置为 false,它会正确地从页面中获取所有外部链接.
On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.
这个问题在任何其他域中都不存在,有人能告诉我这个特定页面是什么吗?
This problem is not present in any other domains , can some tell me what is it with this particular page ?
我还附上了我用于您审查的 nucth-site.xml,请指教.
I have also attached the nucth-site.xml that I am using for your review, please advise.
推荐答案
默认过滤器会忽略您的种子网址,因此不会抓取您的页面.
Your seed url is being ignored by the default filters, so your page is not being crawled.
编辑以下文件:
conf/automaton-urlfilter.txt
conf/automaton-urlfilter.txt
conf/regex-urlfilter.txt
conf/regex-urlfilter.txt
替换
# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*
与
# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*
这篇关于Apache Nutch 未将网页中的内部链接添加到 fetchlist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!