Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won't nutch crawl url with query strings?

查看:55
本文介绍了Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Nutch 的新手,不太确定这里发生了什么.我运行 nutch 并抓取我的网站,但它似乎忽略了包含查询字符串的 URL.我已经注释掉了 crawl-urlfilter.txt 页面中的过滤器,现在看起来像这样:

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

所以,我认为我已经有效地删除了任何过滤器,所以我告诉 nutch 接受它在我的网站上找到的所有网址.

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

有人有什么建议吗?或者这是nutch 1.2中的错误?我应该升级到 1.3,这会解决我遇到的这个问题吗?还是我做错了什么?

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?

推荐答案

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

您必须对其进行评论或修改为:

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

这篇关于Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆