Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won't nutch crawl url with query strings?

查看：55 发布时间：2021/6/11 18:43:48 nutch

本文介绍了Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Nutch 的新手，不太确定这里发生了什么.我运行 nutch 并抓取我的网站，但它似乎忽略了包含查询字符串的 URL.我已经注释掉了 crawl-urlfilter.txt 页面中的过滤器，现在看起来像这样:

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

所以，我认为我已经有效地删除了任何过滤器，所以我告诉 nutch 接受它在我的网站上找到的所有网址.

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

有人有什么建议吗?或者这是nutch 1.2中的错误?我应该升级到 1.3，这会解决我遇到的这个问题吗?还是我做错了什么?

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?

推荐答案

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

您必须对其进行评论或修改为:

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

这篇关于Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won't nutch crawl url with query strings?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won&#39;t nutch crawl url with query strings?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Nutch 1.2 - 为什么 nutch 不使用查询字符串抓取 url? [英] Nutch 1.2 - Why won't nutch crawl url with query strings?

登录关闭