使用java解析robot.txt并识别是否允许url [英] Parsing robot.txt using java and identify whether an url is allowed

查看：26 发布时间：2021/12/17 14:04:06 java web-scraping jsoup crawler4j

本文介绍了使用java解析robot.txt并识别是否允许url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前在应用程序中使用 jsoup 来解析和分析网页.但我想确保我遵守 robots.txt 规则并且只访问允许的页面.

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

我很确定 jsoup 不是为此而制作的，它完全是关于网页抓取和解析的.所以我计划有一个函数/模块，它应该读取域/站点的robot.txt，并确定我要访问的url是否被允许.

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

我做了一些研究并发现了以下内容.但我不确定这些，所以如果有人做同样类型的项目，其中涉及到 robots.txt 解析请分享您的想法和想法.

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crow/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

推荐答案

迟到的答案，以防万一您 - 或其他人 - 仍在寻找一种方法来做到这一点.我正在使用 https://code.google.com/p/crawler-commons/ 在 0.2 版本中，它似乎运行良好.这是我使用的代码中的一个简化示例:

A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然这与 Jsoup 没有任何关系，它只是检查是否允许某个 USER_AGENT 抓取给定的 URL.为了获取 robots.txt，我使用了 4.2.1 版中的 Apache HttpClient，但这也可以用 java.net 来代替.

Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.

请注意，此代码仅检查允许或不允许，而不会考虑其他 robots.txt 功能，例如抓取延迟".但是由于 crawler-commons 也提供了这个功能，所以可以很容易地将它添加到上面的代码中.

Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.

这篇关于使用java解析robot.txt并识别是否允许url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用java解析robot.txt并识别是否允许url [英] Parsing robot.txt using java and identify whether an url is allowed

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用java解析robot.txt并识别是否允许url [英] Parsing robot.txt using java and identify whether an url is allowed

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭