即使 URL 从 seed.txt (Nutch 2.1) 中删除,网站也会被抓取 [英] Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)
问题描述
我在seed.txt 中使用url-1 执行了一次成功的抓取,我可以在MySQL 数据库中看到抓取的数据.现在,当我尝试通过用seed.txt 中的url-1 替换url-1 来执行另一个新的爬网时,新的爬网从获取步骤开始,它试图获取的url 是seed.txt 中旧的替换url.我不确定它是从哪里获取旧网址的.
I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.
我试图检查隐藏的种子文件,但没有找到,并且在我运行爬行命令的 NUTCH_HOME/runtime/local 中只有一个文件夹 urls/seed.txt.请告知可能是什么问题?
I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?
推荐答案
您的抓取数据库包含要抓取的 URL 列表.除非您删除原始抓取目录或创建一个新目录作为新抓取的一部分,否则将使用原始网址列表并使用新网址进行扩展.
Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.
这篇关于即使 URL 从 seed.txt (Nutch 2.1) 中删除,网站也会被抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!