使用java解析robot.txt并确定是否允许使用url [英] Parsing robot.txt using java and identify whether an url is allowed

查看:87
本文介绍了使用java解析robot.txt并确定是否允许使用url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在应用程序中使用jsoup来解析和分析网页。但我想确保我遵守robot.txt规则并且只访问允许的页面。

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

我很确定jsoup不是为此制作的,而是关于网页抓取和解析。
所以我计划让函数/模块读取域/站点的robot.txt,并确定我是否允许访问的URL。

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

我做了一些研究,发现了以下内容。但是我不确定这些,所以如果有人做同样的项目,其中涉及到robot.txt解析,请分享你的想法和想法。

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https:// code。 google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/ src / org / crow / base / Robotstxt.java?r = 12

推荐答案

一个迟到的答案以防万一你 - 或其他人 - 仍在寻找f或者这样做的方法。我正在使用 https://code.google.com/p/crawler-commons/在0.2版本中它似乎运作良好。以下是我使用的代码的简化示例:

A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

显然这与Jsoup没有任何关系,只是检查是否允许给定的URL被抓取某个USER_AGENT。为了获取robots.txt,我在版本4.2.1中使用Apache HttpClient,但这也可以用java.net的东西替换。

Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.

请注意这段代码只检查允许或不允许,并且不考虑其他robots.txt功能,如抓取延迟。但是由于crawler-commons也提供了这个功能,它可以很容易地添加到上面的代码中。

Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.

这篇关于使用java解析robot.txt并确定是否允许使用url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆