带有userAgent的JSoup防止重定向 [英] JSoup with userAgent prevent redirects

查看:195
本文介绍了带有userAgent的JSoup防止重定向的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将JSoup用于我的Web爬网程序

I used JSoup for my web crawler

Connection con = Jsoup.connect("http://t.co/uySIPVNfgP");
Document doc = con.get();
String u = doc.baseUri();

上面的代码将重定向的网址作为基本uri

The above gives the redirected url as the base uri

但是设置了以下用户代理:

But with a User Agent set as follows:

con.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");

使用上面的方法不遵循重定向.

With the above does not follow the redirect.

据我所知,没有用户代理,某些网站不允许抓取其内容.

As I know without a User Agent some websites does not allow its contents to be crawled.

如何解决这个问题?

推荐答案

似乎 http://t.co/uySIPVNfgP 确实可以设置用户代理后,服务器端重定向不响应.但是它将重定向作为元重定向发送到html页面.

It seems that http://t.co/uySIPVNfgP does not responds with server side redirect when the user agent is set. But it sends the redirect in the html page as a meta redirect.

使用jsoup,我可以按以下方式捕获重定向的URL:

With jsoup I was able to catch the redirected url as follows:

Document doc = con.get();
Elements redirEle = doc.head().select("meta[http-equiv=refresh]");
String content = redirEle.get(0).attr("content");
Pattern pattern = Pattern.compile("^.*URL=(.+)$", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(content);
if (matcher.matches() && matcher.groupCount() > 0) {
     String redirectUrl = matcher.group(1);
     if(redirectUrl.startsWith("'")){
         /*removes single quotes of urls within single quotes*/
         redirectUrl = redirectUrl.replaceAll("(^')|('$)","");
     }
     if(redirectUrl.startsWith("/")){
         String[] splitedUrl = url.split("/");
         redirectUrl = splitedUrl[0]+"//"+splitedUrl[2]+redirectUrl;
     }
     return redirectUrl;
}

这篇关于带有userAgent的JSoup防止重定向的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆