网页抓取带有动态 javascript 内容的网站 [英] Web scraping a website with dynamic javascript content

查看:34
本文介绍了网页抓取带有动态 javascript 内容的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我使用 python 和 beautifulsoup4(我没有绑定)来抓取一个网站.问题是当我使用 urlib 抓取页面的 html 时,它不是整个页面,因为其中一些是通过 javascript 生成的.有什么办法可以解决这个问题吗?

So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?

推荐答案

基本上有两个主要选项可以进行:

There are basically two main options to proceed with:

  • 使用浏览器开发人员工具,查看将要加载页面的 ajax 请求并在脚本中模拟它们,您可能需要使用 json 模块将响应json字符串加载到python数据结构中
  • 使用诸如 selenium 之类的工具来打开真正的浏览器.浏览器也可以无头",参见 Headless Selenium使用 Python 和 PhantomJS 进行测试
  • using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
  • use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS

第一个选项更难实现,一般来说,它更脆弱,但它不需要真正的浏览器,而且速度更快.

The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.

就您获得任何其他真实用户所获得的信息而言,第二个选项更好,并且您不必担心页面是如何加载的.Selenium 在页面上定位元素方面非常强大——你可能不需要BeautifulSoup 一点.但是,无论如何,这个选项比第一个要慢.

The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup at all. But, anyway, this option is slower than the first one.

希望有所帮助.

这篇关于网页抓取带有动态 javascript 内容的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆