JSoup +链接提取+重定向URL [英] JSoup + Link extraction + redirect URL

查看:208
本文介绍了JSoup +链接提取+重定向URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码在大多数情况下都有效.当网站重定向到新URL时,它将失败.例如,URL: http://www.oil-india.com/ 重定向到 http://www.oil-india.com/oilnew/ 在浏览器中.使用JSoup,以下代码无法从原始URL检索链接.

My code works for most cases. It fails when the site redirects to a new URL. For example the URL: http://www.oil-india.com/ redirects to http://www.oil-india.com/oilnew/ in the browser. With JSoup the below code fails to retrieve links from the original URL.

doc = Jsoup.connect(url).timeout(0).userAgent(USER_AGENT).validateTLSCertificates(false).followRedirects(true).get();

Elements subLinks = doc.select("a[href]");

推荐答案

如果打印出文档,您会注意到,重定向是使用javascript完成的:

If you print out the document you will notice, that the redirect is done using javascript:

[...]
window.location.href = '../oilnew/'; 
[...]

您可以手动解析脚本标记,找到window.location.href时可以检查它是否在加载时触发并提取目标,也可以使用

You could parse the script tag manually and when finding window.location.href either check if it is triggered on load and extract the target or use HtmlUnit (though it is quite slow) to follow the redirects.

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36";
String url = "http://www.oil-india.com/";

Document doc;
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);

try {
    url = webClient.getPage(url).getUrl().toString(); // HtmlUnit
    doc = Jsoup.connect(url).userAgent(userAgent).followRedirects(true).get(); // jsoup
    System.out.println(doc.toString());
} catch (FailingHttpStatusCodeException | IOException e) {
    e.printStackTrace();
}

输出

<a href="#" class="close">Close</a>
<a href="default.aspx"><img src="oilindia-img/logo.jpg" alt="Oil India" style="height:95px;"></a>
 <a href="screenreader.aspx"><img src="oilindia-img/screen_reader_icon.png" style="vertical-align:middle;" alt="top"><span id="MenuBarTop_link_screenreader" class="link_screenreader">Screen Reader Access</span> </a>
<a href="javascript:decreaseFontSize();" class="toplink"> <img alt="orange color" src="oilindia-img/a-.png" id="Img1"> </a>
[...]

这篇关于JSoup +链接提取+重定向URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆