使用python读取动态生成的网页 [英] Reading dynamically generated web pages using python
问题描述
我正在尝试使用 python 和漂亮的汤来抓取一个网站.我遇到了在一些网站上,虽然在浏览器上看到的图片链接在源代码中是看不到的.但是在使用 Chrome Inspect 或 Fiddler 时,我们可以看到相应的代码.我在源代码中看到的是:
但是在 Chrome Inspect 上,我可以看到在这个 div 类中生成的一大堆 HTML\CSS 代码.有没有办法在python中加载生成的内容?我在 python 中使用常规 urllib,我能够获取源代码但没有生成的部分.
我不是网络开发人员,因此我无法更好地表达行为.如果我的问题看起来含糊不清,请随时澄清!
您需要 JavaScript 引擎来解析和运行页面内的 JavaScript 代码.有一堆无头浏览器可以帮助您
http://code.google.com/p/spynner/
http://github.com/ryanpetrello/python-zombie
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://github.com/ryanpetrello/python-zombie
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
这篇关于使用python读取动态生成的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!