如何使用 apache nutch 2.2.1 绕过 robots.txt [英] how to bypass robots.txt with apache nutch 2.2.1

查看:93
本文介绍了如何使用 apache nutch 2.2.1 绕过 robots.txt的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能告诉我 apache nutch 在爬行时是否有任何方法可以忽略或绕过 robots.txt.我正在使用 nutch 2.2.1.我发现RobotRulesParser.java"(完整路径:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java) 负责读取和解析robots.txt.有什么办法可以修改这个文件忽略robots.txt继续爬行吗?

Can anyone please tell me if there is any way for apache nutch to ignore or bypass robots.txt while crawling. I am using nutch 2.2.1. I found that "RobotRulesParser.java"(full path:-src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/ RobotRulesParser.java) is responsible for the reading and parsing the robots.txt. Is there any way to modify this file to ignore robots.txt and go on with crawling?

或者有没有其他方法可以达到同样的效果?

Or is there any other way to achieve the same?

推荐答案

  1. 首先,如果您正在抓取任何外部站点,我们应该尊重 robots.txt 文件.否则,您将面临风险 - 您的 IP 被禁止或更糟可能是任何法律案件.

  1. At first, we should respect the robots.txt file if you are crawling any external sites. Otherwise you are at risk - your IP banned or worse can be any legal case.

如果您的网站是内部网站并且不对外公开,​​那么您应该更改 robots.txt 文件以允许您的抓取工具.

If your site is internal and not expose to external world, then you should change the robots.txt file to allow your crawler.

如果您的站点暴露在 Internet 上并且数据是机密的,那么您可以尝试以下选项.因为在这里您不能冒险修改 robots.txt 文件,因为外部爬虫可以使用您的爬虫名称来抓取网站.

If your site is exposed to the Internet and if data is confidential, then you can try out the following option. Because here you cannot take a risk of modifying the robots.txt file since external crawler can use your crawler name and crawl the site.

在 Fetcher.java 文件中:

In Fetcher.java file:

if (!rules.isAllowed(fit.u.toString())) { }

这是负责阻止 URL 的块.您可以使用此代码块来解决您的问题.

This is the block that is responsible for blocking the URLs. You can play around this code block to resolve your issue.

这篇关于如何使用 apache nutch 2.2.1 绕过 robots.txt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆