网络刮动态JavaScript内容的网站 [英] Web scraping a website with dynamic javascript content
问题描述
所以我使用Python和beautifulsoup4(这我不是绑)凑一个网站。问题是,当我使用urlib抓取页面,因为有些是通过javascript的生成它不是整个页面的HTML。有没有什么办法来解决这个问题?
So I'm using python and beautifulsoup4(which i'm not tied to) to scrape a website. Problem is when I use urlib to grab the html of a page it's not the entire page because some of it is generated via the javascript. Is there any way to get around this?
推荐答案
有基本上着手进行两个主要选项:
There are basically two main options to proceed with:
- 使用浏览器的开发者工具,看看Ajax请求会加载页面,并在你的脚本模拟它们,你可能会需要使用的 JSON 模块加载的响应JSON字符串到Python的数据结构
- 使用工具,如硒打开了一个真正的浏览器。该浏览器也可以是无头,见<一href=\"http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/#.UzWCza1dXQg\">Headless硒测试使用Python和PhantomJS
- using browser developer tools, see what ajax requests are going to load the page and simulate them in your script, you will probably need to use json module to load the response json string into python data structure
- use tools like selenium that open up a real browser. The browser can also be "headless", see Headless Selenium Testing with Python and PhantomJS
第一个选项是更难以实施和它的,一般来说,更脆弱,但它并不需要真正的浏览器,并可以更快。
The first option is more difficult to implement and it's, generally speaking, more fragile, but it doesn't require a real browser and can be faster.
第二个选项是在你更好的方面得到任何其他真正的用户得到什么,你不会担心页面的加载方式。硒是在 pretty强大的定位元素在页面上 - 你可能不需要 BeautifulSoup
可言。但是,不管怎样,这个选项是比第一次慢。
The second option is better in terms of you get what any other real user gets and you wouldn't be worried about how the page was loaded. Selenium is pretty powerful in locating elements on a page - you may not need BeautifulSoup
at all. But, anyway, this option is slower than the first one.
希望有所帮助。
这篇关于网络刮动态JavaScript内容的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!