刮除HTML与JavaScript的蟒蛇产生 [英] scrape html generated by javascript with python

查看:109
本文介绍了刮除HTML与JavaScript的蟒蛇产生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要凑与Python的网站。我得到HTML源$ C ​​$ C与urlib模块,但我也需要刮掉由JavaScript函数(包括在HTML源代码)生成一些HTML code。这是什么做的功能,在网站是,当你preSS一个按钮,它可以输出一些HTML code。我如何preSS这个按钮与Python code? scrapy能帮我吗?我抓获萤火POST请求,但是当我试图通过它的网址我得到一个403错误。有什么建议?

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?

推荐答案

在Python中,我觉得硒1.0 是要走的路。这是一个库,允许您控制您所选择的语言真正的Web浏览器。

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

您需要在安装脚本运行在计算机上的问题的网络浏览器,但它看起来像最可靠的方法以编程方式询问使用大量的JavaScript的网站。

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

这篇关于刮除HTML与JavaScript的蟒蛇产生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆