JSoup +链接提取+重定向URL [英] JSoup + Link extraction + redirect URL
问题描述
我的代码在大多数情况下都有效.当网站重定向到新URL时,它将失败.例如,URL: http://www.oil-india.com/ 重定向到 http://www.oil-india.com/oilnew/ 在浏览器中.使用JSoup,以下代码无法从原始URL检索链接.
My code works for most cases. It fails when the site redirects to a new URL. For example the URL: http://www.oil-india.com/ redirects to http://www.oil-india.com/oilnew/ in the browser. With JSoup the below code fails to retrieve links from the original URL.
doc = Jsoup.connect(url).timeout(0).userAgent(USER_AGENT).validateTLSCertificates(false).followRedirects(true).get();
Elements subLinks = doc.select("a[href]");
推荐答案
如果打印出文档,您会注意到,重定向是使用javascript完成的:
If you print out the document you will notice, that the redirect is done using javascript:
[...]
window.location.href = '../oilnew/';
[...]
您可以手动解析脚本标记,找到window.location.href
时可以检查它是否在加载时触发并提取目标,也可以使用
You could parse the script tag manually and when finding window.location.href
either check if it is triggered on load and extract the target or use HtmlUnit (though it is quite slow) to follow the redirects.
示例代码
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36";
String url = "http://www.oil-india.com/";
Document doc;
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
try {
url = webClient.getPage(url).getUrl().toString(); // HtmlUnit
doc = Jsoup.connect(url).userAgent(userAgent).followRedirects(true).get(); // jsoup
System.out.println(doc.toString());
} catch (FailingHttpStatusCodeException | IOException e) {
e.printStackTrace();
}
输出
<a href="#" class="close">Close</a>
<a href="default.aspx"><img src="oilindia-img/logo.jpg" alt="Oil India" style="height:95px;"></a>
<a href="screenreader.aspx"><img src="oilindia-img/screen_reader_icon.png" style="vertical-align:middle;" alt="top"><span id="MenuBarTop_link_screenreader" class="link_screenreader">Screen Reader Access</span> </a>
<a href="javascript:decreaseFontSize();" class="toplink"> <img alt="orange color" src="oilindia-img/a-.png" id="Img1"> </a>
[...]
这篇关于JSoup +链接提取+重定向URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!