apache nutch 不抓取网站 [英] apache nutch don't crawl website

查看：49 发布时间：2021/6/11 18:43:12 solr web-crawler nutch

本文介绍了apache nutch 不抓取网站的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经安装了用于网络爬行的 apache nutch.我想抓取一个包含以下 robots.txt 的网站:

I have installed the apache nutch for web crawling. I want to crawl a website that has the following robots.txt:

User-Agent: *
Disallow: /

有什么办法可以用apache nutch抓取这个网站吗?

Is there any way to crawl this website with apache nutch?

推荐答案

在 nutch-site.xml 中，将 protocol.plugin.check.robots 设置为 false

In nutch-site.xml, set protocol.plugin.check.robots to false

或

您可以注释掉完成机器人检查的代码.在 Fetcher.java 中，第 605-614 行正在进行检查.注释整个块

You can comment out the code where the robots check is done. In Fetcher.java, lines 605-614 are doing the check. Comment that entire block

      if (!rules.isAllowed(fit.u)) {
        // unblock
        fetchQueues.finishFetchItem(fit, true);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Denied by robots.txt: " + fit.url);
        }
        output(fit.url, fit.datum, null, ProtocolStatus.STATUS_ROBOTS_DENIED, CrawlDatum.STATUS_FETCH_GONE);
        reporter.incrCounter("FetcherStatus", "robots_denied", 1);
        continue;
      }

这篇关于apache nutch 不抓取网站的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

apache nutch 不抓取网站 [英] apache nutch don't crawl website

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

apache nutch 不抓取网站 [英] apache nutch don&#39;t crawl website

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

apache nutch 不抓取网站 [英] apache nutch don't crawl website

登录关闭