在Python中使用Java脚本抓取网页 [英] Scraping a web page with java script in Python

查看:74
本文介绍了在Python中使用Java脚本抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Windows机器上使用python 3.2(newb)工作(尽管如果需要,我可以在虚拟机上使用ubuntu 10.04,但我更喜欢在Windows机器上工作).

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).

基本上,我可以使用http模块和urlib模块来抓取网页,但是只能抓取那些没有Java脚本document.write(< div ....")和那样会添加我获得实际页面时不存在的数据(这意味着没有真正的Ajax脚本).

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).

同样要处理这类网站,我很确定我需要一个浏览器Java脚本处理器才能在页面上工作,并提供最终结果的输出,希望是字典或文本.

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.

我试图编译python-spider猴子,但我了解它不适用于Windows,并且无法与python 3.x一起使用:-?

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?

有什么建议吗?如果有人做了这样的事情,我将不胜感激!

Any suggestions ? if anyone did something like that before i'll appreciate the help!

推荐答案

我建议将python与webkit库的绑定-

I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

这篇关于在Python中使用Java脚本抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆