在Python中使用Java脚本抓取网页 [英] Scraping a web page with java script in Python
问题描述
我正在Windows机器上使用python 3.2(newb)工作(尽管如果需要,我可以在虚拟机上使用ubuntu 10.04,但我更喜欢在Windows机器上工作).
i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).
基本上,我可以使用http模块和urlib模块来抓取网页,但是只能抓取那些没有Java脚本document.write(< div ....")和那样会添加我获得实际页面时不存在的数据(这意味着没有真正的Ajax脚本).
Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).
同样要处理这类网站,我很确定我需要一个浏览器Java脚本处理器才能在页面上工作,并提供最终结果的输出,希望是字典或文本.
To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.
我试图编译python-spider猴子,但我了解它不适用于Windows,并且无法与python 3.x一起使用:-?
I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?
有什么建议吗?如果有人做了这样的事情,我将不胜感激!
Any suggestions ? if anyone did something like that before i'll appreciate the help!
推荐答案
I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.
这篇关于在Python中使用Java脚本抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!