即使 URL 从 seed.txt (Nutch 2.1) 中删除，网站也会被抓取 [英] Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

查看：48 发布时间：2021/6/11 18:43:09 nutch web-crawler

本文介绍了即使 URL 从 seed.txt (Nutch 2.1) 中删除，网站也会被抓取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在seed.txt 中使用url-1 执行了一次成功的抓取，我可以在MySQL 数据库中看到抓取的数据.现在，当我尝试通过用seed.txt 中的url-1 替换url-1 来执行另一个新的爬网时，新的爬网从获取步骤开始，它试图获取的url 是seed.txt 中旧的替换url.我不确定它是从哪里获取旧网址的.

I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.

我试图检查隐藏的种子文件，但没有找到，并且在我运行爬行命令的 NUTCH_HOME/runtime/local 中只有一个文件夹 urls/seed.txt.请告知可能是什么问题?

I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?

即使 URL 从 seed.txt (Nutch 2.1) 中删除，网站也会被抓取 [英] Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

即使 URL 从 seed.txt (Nutch 2.1) 中删除，网站也会被抓取 [英] Sites are crawled even when the URL is removed from seed.txt (Nutch 2.1)

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭