Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won't nutch crawl url with query strings?
问题描述
我是 Nutch 的新手,不太确定这里发生了什么.我运行 nutch 并抓取我的网站,但它似乎忽略了包含查询字符串的 URL.我已经注释掉了 crawl-urlfilter.txt 页面中的过滤器,现在看起来像这样:
I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
所以,我认为我已经有效地删除了任何过滤器,所以我告诉 nutch 接受它在我的网站上找到的所有网址.
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
有人有什么建议吗?或者这是nutch 1.2中的错误?我应该升级到 1.3,这会解决我遇到的这个问题吗?还是我做错了什么?
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
推荐答案
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
您必须对其进行评论或修改为:
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
这篇关于Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!