如何使机械化等待网页“满载"? [英] How to make mechanize wait for web-page 'full' load?
问题描述
我想抓取一些网页来动态加载其组件. 该页面具有onload脚本,在将URL输入浏览器后的3-5秒内,我可以看到完整的页面.
I want to scrape some web page which loads its components dynamically. This page has an onload script, and I can see the complete page 3-5 seconds after typing the URL into my browser.
问题是,当我呼叫br.open('URL')
时,响应是在0秒的网页.
HTML(我想要的)和br.open('URL')
的结果之间有3-5秒的差异.
The problem is, when I call br.open('URL')
, the response is the web page at 0 seconds.
There is a difference 3-5 seconds later between the HTML (which I want) and result of br.open('URL')
.
推荐答案
使用机械化处理具有丰富javascript内容的网页并不容易,但是有多种方法可以根据不同情况获得所需的内容.
Working a webpage with a rich javascripts content with mechanize is not much easy, but there are ways to get what you want according to different situations.
-
如果提出了一些创建内容的json请求,则可以调用该url并尝试解析响应以获取内容,然后尝试将其正确加入.
If some json requests are made to create the content, then you can call that urls and try to parse responses to get content, then try to join it properly.
如果需要使用某些表单,则可以创建一些表单字段并在机械化中设置它们的值.或者,只需编写一种方法即可对您的POST
或GET
数据(带引号的特殊字符等)进行编码,并使用mechanize.browser.open
方法发送它们.
If you need to use some forms, you can create some form fields and set their values within mechanize. Or , simply write a method that will encode your POST
or GET
data (quote special characters etc..) and send them with mechanize.browser.open
method.
如果页面具有一些基于javascript的安全功能(例如在发布数据之前对表单数据进行某种特殊编码),则可以使用 node.js (例如javascript应用程序服务器)来处理一些javascript代码块.
If page has some javascript based security functions (like some special encoding to form data before posting them), then you may use node.js like javascript application servers to process some javascript code blocks.
但是实际上,上面的某些选项并不容易实现,在对此类项目使用机械化之前,您必须三思.
But in fact, some of the above options are not easy to do, and you must think twice before using mechanize for such projects.
这篇关于如何使机械化等待网页“满载"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!