在 Apache HttpComponents 中获取重定向的 URL [英] Getting redirected URL in Apache HttpComponents

查看:113
本文介绍了在 Apache HttpComponents 中获取重定向的 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Apache HttpComponents 来获取一些已抓取网址的网页.许多这些 URL 实际上重定向到不同的 URL(例如,因为它们已被 URL 缩短器处理过).除了下载内容之外,我还想解析最终 URL(即提供下载内容的 URL),或者更好的是解析重定向链中的所有 URL.

I'm using Apache HttpComponents to GET some web pages for some crawled URLs. Many of those URLs actually redirect to different URLs (e.g. because they have been processed with a URL shortener). Additionally to downloading the content, I would like to resolve the final URLs (i.e. the URL which provided the downloaded content), or even better, all URLs in the redirect chain.

我一直在浏览 API 文档,但不知道我可以在哪里挂钩.任何提示将不胜感激.

I have been looking through the API docs, but got no clue, where I could hook. Any hints would be greatly appreciated.

推荐答案

这里是完整演示关于如何使用 Apache HttpComponents 做到这一点.

Here's a full demo of how to do it using Apache HttpComponents.

您需要像这样扩展 DefaultRedirectStrategy:

class SpyStrategy extends DefaultRedirectStrategy {
    public final Deque<URI> history = new LinkedList<>();

    public SpyStrategy(URI uri) {
        history.push(uri);
    }

    @Override
    public HttpUriRequest getRedirect(
            HttpRequest request,
            HttpResponse response,
            HttpContext context) throws ProtocolException {
        HttpUriRequest redirect = super.getRedirect(request, response, context);
        history.push(redirect.getURI());
        return redirect;
    }
}

expand 方法发送 HEAD 请求,导致 clientspy.history 双端队列中收集 URI,因为它会自动跟随重定向:

expand method sends a HEAD request which causes client to collect URIs in spy.history deque as it follows redirects automatically:

public static Deque<URI> expand(String uri) {
    try {
        HttpHead head = new HttpHead(uri);
        SpyStrategy spy = new SpyStrategy(head.getURI());
        DefaultHttpClient client = new DefaultHttpClient();
        client.setRedirectStrategy(spy);
        // FIXME: the following completely ignores HTTP errors:
        client.execute(head);
        return spy.history;
    }
    catch (IOException e) {
        throw new RuntimeException(e);
    }
}

您可能希望将重定向的最大数量设置为合理的(而不是默认值 100),如下所示:

You may want to set maximum number of redirects followed to something reasonable (instead of the default of 100) like so:

        BasicHttpParams params = new BasicHttpParams();
        params.setIntParameter(ClientPNames.MAX_REDIRECTS, 5);
        DefaultHttpClient client = new DefaultHttpClient(params);

这篇关于在 Apache HttpComponents 中获取重定向的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆