使用Javascript抓取网站? [英] Scraping websites with Javascript enabled?

查看:134
本文介绍了使用Javascript抓取网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓住并向严重依赖Javascript执行大部分操作的网站提交信息。当我在浏览器中禁用Javascript时,该网站甚至无法工作。

I'm trying to scrape and submit information to websites that heavily rely on Javascript to do most of its actions. The website won't even work when i disable Javascript in my browser.

我在Google和SO上搜索了一些解决方案,并且有人建议我应该撤消设计Javascript,但我不知道该怎么做。

I've searched for some solutions on Google and SO and there was someone who suggested i should reverse engineer the Javascript, but i have no idea how to do that.

到目前为止,我一直在使用Mechanize,它适用于不需要Javascript的网站。

So far i've been using Mechanize and it works on websites that don't require Javascript.

有没有办法通过使用urllib2或类似的东西访问使用Javascript的网站?
我也愿意学习Javascript,如果这就是它。

Is there any way to access websites that use Javascript by using urllib2 or something similar? I'm also willing to learn Javascript, if that's what it takes.

推荐答案

我写了一个小教程这个主题,这可能有所帮助:

I wrote a small tutorial on this subject, this might help:

http://koaning.io/dynamic-scraping-with-python.html

基本上你所做的就是你让selenium图书馆假装它是一个firefox浏览器,浏览器将等待所有javascript加载后继续传递你的html字符串。一旦你有了这个字符串,你就可以用beautifulsoup解析它。

Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.

这篇关于使用Javascript抓取网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆