使用 Nutch 抓取指定的 URL 列表 [英] Using Nutch to crawl a specified URL list

查看：43 发布时间：2021/6/11 18:41:39 nutch web-crawler

本文介绍了使用 Nutch 抓取指定的 URL 列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有 100 万个 URL 列表要获取.我将此列表用作 nutch 种子，并使用 Nutch 的基本 crawl 命令来获取它们.但是，我发现 Nutch 会自动获取不在列表中的 URL.我确实将爬网参数设置为 -depth 1 -topN 1000000.但它不起作用.有人知道怎么做吗?

I have one million URL list to fetch. I use this list as nutch seeds and use the basic crawl command of Nutch to fetch them. However, I find that Nutch automatically fetches not-on-list URLs. I do set the crawl parameters as -depth 1 -topN 1000000. But it does not work. Does anyone know how to do this?

推荐答案

在 nutch-site.xml 中设置此属性.(默认情况下它是真的，所以它会向 crawldb 添加外链)

Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

这篇关于使用 Nutch 抓取指定的 URL 列表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Nutch 抓取指定的 URL 列表 [英] Using Nutch to crawl a specified URL list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 Nutch 抓取指定的 URL 列表 [英] Using Nutch to crawl a specified URL list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭