即使 URL 从 seed.txt (Nutch 2.1) 中删除,网站也会被抓取 [英] Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

查看:48
本文介绍了即使 URL 从 seed.txt (Nutch 2.1) 中删除,网站也会被抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在seed.txt 中使用url-1 执行了一次成功的抓取,我可以在MySQL 数据库中看到抓取的数据.现在,当我尝试通过用seed.txt 中的url-1 替换url-1 来执行另一个新的爬网时,新的爬网从获取步骤开始,它试图获取的url 是seed.txt 中旧的替换url.我不确定它是从哪里获取旧网址的.

I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.

我试图检查隐藏的种子文件,但没有找到,并且在我运行爬行命令的 NUTCH_HOME/runtime/local 中只有一个文件夹 urls/seed.txt.请告知可能是什么问题?

I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?

推荐答案

您的抓取数据库包含要抓取的 URL 列表.除非您删除原始抓取目录或创建一个新目录作为新抓取的一部分,否则将使用原始网址列表并使用新网址进行扩展.

Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.

这篇关于即使 URL 从 seed.txt (Nutch 2.1) 中删除,网站也会被抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆