使用java解析robot.txt并确定是否允许使用url [英] Parsing robot.txt using java and identify whether an url is allowed

查看：87 发布时间：2018/12/12 19:36:21 java web-scraping jsoup crawler4j

本文介绍了使用java解析robot.txt并确定是否允许使用url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守robot.txt规则并且只访问允许的页面。

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

我很确定jsoup不是为此制作的，而是关于网页抓取和解析。
所以我计划让函数/模块读取域/站点的robot.txt，并确定我是否允许访问的URL。

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

我做了一些研究，发现了以下内容。但是我不确定这些，所以如果有人做同样的项目，其中涉及到robot.txt解析，请分享你的想法和想法。

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https：// code。 google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/ src / org / crow / base / Robotstxt.java？r = 12

推荐答案

一个迟到的答案以防万一你 - 或其他人 - 仍在寻找f或者这样做的方法。我正在使用 https://code.google.com/p/crawler-commons/在0.2版本中它似乎运作良好。以下是我使用的代码的简化示例：

A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然这与Jsoup没有任何关系，只是检查是否允许给定的URL被抓取某个USER_AGENT。为了获取robots.txt，我在版本4.2.1中使用Apache HttpClient，但这也可以用java.net的东西替换。

Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.

请注意这段代码只检查允许或不允许，并且不考虑其他robots.txt功能，如抓取延迟。但是由于crawler-commons也提供了这个功能，它可以很容易地添加到上面的代码中。

Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.

这篇关于使用java解析robot.txt并确定是否允许使用url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用java解析robot.txt并确定是否允许使用url [英] Parsing robot.txt using java and identify whether an url is allowed

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用java解析robot.txt并确定是否允许使用url [英] Parsing robot.txt using java and identify whether an url is allowed

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭