给一个 url,重定向是一个带空格的 url,给 Jsoup 会导致错误.这个怎么解决? [英] Giving an url, that redirected is a url with spaces, to Jsoup leads to an error. How resolve this?

查看:46
本文介绍了给一个 url,重定向是一个带空格的 url,给 Jsoup 会导致错误.这个怎么解决?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我必须解析通过服务器重定向解析 URI 的页面.

Hello I have to parse pages wich URI is resolved by server redirect.

示例:

我有 http://重定向的 www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera%20convocati%20villar%20news%2010agosto2013?pragma=no-cache

这是我必须解析的页面的 URI.问题是重定向 URI 包含空格,这是代码.

This is URI of the page that I have to parse. The problem is that redirect URI contains spaces, here's the code.

    String url = "http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020";
    Document doc = Jsoup.connect(url).get();

    Element img = doc.select(".juveShareImage").first();
    String imgurl = img.absUrl("src");
    System.out.println(imgurl);

我在第二行收到此错误:

I get this error at the second line:

    Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera convocati villar news 10agosto2013?pragma=no-cache

包含重定向的 url,所以这意味着 JSoup 获得了正确的重定向 URI.有没有办法用 %20 替换"",以便我可以毫无问题地解析?

that contains the redirected url, so this means that JSoup gets the correct redirected URI. Is there a way to replace the ' ' with %20 so I can parse with no problem?

谢谢!

推荐答案

你说得对.这就是问题.我看到的唯一解决方案是执行重定向手册.我为你写了这个小的递归方法.见:

You are right. This is the problem. The only solution I see is to do the redirects manual. I wrote this small recursive method doing this for you. See:

public static void main(String[] args) throws IOException
{
    String url = "http://www.juventus.com/wps/poc?uri=wcm:oid:91da6dbb-4089-49c0-a1df-3a56671b7020";

    Document document = manualRedirectHandler(url);

    Elements elements = document.getElementsByClass("juveShareImage");

    for (Element element : elements)
    {
        System.out.println(element.attr("src"));
    }

}

private static Document manualRedirectHandler(String url) throws IOException
{
    Response response = Jsoup.connect(url.replaceAll(" ", "%20")).followRedirects(false).execute();
    int status = response.statusCode();

    if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER)
    {
        String redirectUrl = response.header("location");
        System.out.println("Redirect to: " + redirectUrl);
        return manuelRedirectHandler(redirectUrl);
    }

    return Jsoup.parse(response.body());
}

这会打印你

Redirect to: http://www.juventus.com:80/wps/portal/!ut/p/b0/DcdJDoAgEATAF00GXFC8-QqVWwMuJLLEGP2-1q3Y8Mwm4Qk77pATzv_L6-KQgx-09FDeWmpEr6nRThCk36hGq1QnbScqwRMbNuXCHsFLyuTgjpVLjOMHyfCBUg!!/
Redirect to: http://www.juventus.com/wps/wcm/connect/JUVECOM-IT/news/primavera convocati villar news 10agosto2013?pragma=no-cache
/resources/images/news/inlined/42d386ef-1443-488d-8f3e-583b1e5eef61.jpg

为此我还为 Jsoup 添加了一个补丁:

I also added a patch for Jsoup for that:

这篇关于给一个 url,重定向是一个带空格的 url,给 Jsoup 会导致错误.这个怎么解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆