Http URLConnection等待内部请求 [英] Http URLConnection wait for inner request
问题描述
我正在抓取一个项目.当我与网站进行简单的URLConnection
连接时,如下所示:
I am working on a crawling project. When I do a simple URLConnection
connection to the website as shown in below:
URLConnection conn =新的URL(url).openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
URLConnection conn = new URL(url).openConnection(); BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream()));
该方法正确返回HTML正文.但是,该网站对某些字段提出了内部要求.例如,网站从不同的Web服务获取用户总数.在Web浏览器中,一段时间后会出现用户总数,但是使用URLConnection
方法不会等待用户总数,并且返回的HTML不包含该字段.
The method returns the HTML body correctly. However, the website makes inner requests for some fields. For example, the website fetches the total number of users from a different web service. In the web browser, the total number of users appear after some time, but with the URLConnection
method does not wait for the total number of users and the returned HTML does not contain such field.
在Java中,有什么方法需要等待一段时间才能使用URLConnection
从网站中获取所有数据.
In Java, is there any way to wait for a while to fetch all the data from a website using URLConnection
.
推荐答案
从您的内部请求"评论看来,网站正在使用JavaScript(通过框架或仅使用本机浏览器API)来获取数据并呈现这些结果放入 DOM .如今,这在 SPA 等
From your "inner requests" comment it sounds like the website is using JavaScript (via a framework or just using native browser APIs) to fetch data and render these results into the DOM. This is very common nowadays with SPAs etc.
在这种情况下,使用简单的HTTP库(如URLConnection
)不会产生任何等待量的结果-但是您可以通过在本地保存HTML并在浏览器中查看HTML来进行检查-会发生什么?当您检查它时,该页面上有JavaScript吗?
If that's the case, no amount of waiting will change the outcome from using a simple HTTP library like URLConnection
- but you can check this by saving the HTML locally and viewing it in your browser - what happens? When you examine it, is there JavaScript on that page?
要在代码中正确执行此操作,您将需要具有类似于浏览器的功能,并能够在类似DOM的环境中执行HTML引用的JS.尝试硒与 PhantomJS 或无头的Chrome/Firefox,或者 GhostDriver .
To do this properly in code, you'll need something capable of behaving more like a browser, and executing that JS referenced by the HTML in a DOM-like environment. Try Selenium with PhantomJS or headless Chrome / Firefox, or maybe GhostDriver.
这篇关于Http URLConnection等待内部请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!