网络刮动态JavaScript内容的网站 [英] Web scraping a website with dynamic javascript content

查看:367
本文介绍了网络刮动态JavaScript内容的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我使用Python和beautifulsoup4(这我不是绑)凑一个网站。问题是,当我使用urlib抓取页面,因为有些是通过javascript的生成它不是整个页面的HTML。有没有什么办法来解决这个问题?

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?

推荐答案

有基本上着手进行两个主要选项:

There are basically two main options to proceed with:


  • 使用浏览器的开发者工具,看看Ajax请求会加载页面,并在你的脚本模拟它们,你可能会需要使用的 JSON 模块加载的响应JSON字符串到Python的数据结构

  • 使用工具,如打开了一个真正的浏览器。该浏览器也可以是无头,见<一href=\"http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/#.UzWCza1dXQg\">Headless硒测试使用Python和PhantomJS

  • using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
  • use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS

第一个选项是更难以实施和它的,一般来说,更脆弱,但它并不需要真正的浏览器,并可以更快。

The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.

第二个选项是在你更好的方面得到任何其他真正的用户得到什么,你不会担心页面的加载方式。硒是在 pretty强大的定位元素在页面上 - 你可能不需要 BeautifulSoup 可言。但是,不管怎样,这个选项是比第一次慢。

The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.

希望有所帮助。

这篇关于网络刮动态JavaScript内容的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆