使用java解析robot.txt并识别是否允许url [英] Parsing robot.txt using java and identify whether an url is allowed

查看:26
本文介绍了使用java解析robot.txt并识别是否允许url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在应用程序中使用 jsoup 来解析和分析网页.但我想确保我遵守 robots.txt 规则并且只访问允许的页面.

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

我很确定 jsoup 不是为此而制作的,它完全是关于网页抓取和解析的.所以我计划有一个函数/模块,它应该读取域/站点的robot.txt,并确定我要访问的url是否被允许.

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

我做了一些研究并发现了以下内容.但我不确定这些,所以如果有人做同样类型的项目,其中涉及到 robots.txt 解析请分享您的想法和想法.

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crow/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

推荐答案

迟到的答案,以防万一您 - 或其他人 - 仍在寻找一种方法来做到这一点.我正在使用 https://code.google.com/p/crawler-commons/ 在 0.2 版本中,它似乎运行良好.这是我使用的代码中的一个简化示例:

A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然这与 Jsoup 没有任何关系,它只是检查是否允许某个 USER_AGENT 抓取给定的 URL.为了获取 robots.txt,我使用了 4.2.1 版中的 Apache HttpClient,但这也可以用 java.net 来代替.

Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.

请注意,此代码仅检查允许或不允许,而不会考虑其他 robots.txt 功能,例如抓取延迟".但是由于 crawler-commons 也提供了这个功能,所以可以很容易地将它添加到上面的代码中.

Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.

这篇关于使用java解析robot.txt并识别是否允许url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆