用python抓取动态内容 [英] web scraping dynamic content with python

查看:54
本文介绍了用python抓取动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Python 抓取您是否正在寻找这些作者:"这样的网页上的框的内容:http://academic.research.microsoft.com/Search?query=lander

不幸的是,该框的内容由 JavaScript 动态加载.通常在这种情况下,我可以阅读 Javascript 来弄清楚发生了什么,或者我可以使用像 Firebug 这样的浏览器扩展来找出动态内容的来源.这次没有这样的运气……Javascript 非常复杂,Firebug 没有提供很多关于如何获取内容的线索.

有什么技巧可以让这个任务变得简单吗?

解决方案

与其尝试逆向工程,您还可以使用 ghost.py 直接与页面上的 JavaScript 交互.

如果您在 Chrome 控制台中运行以下查询,您会看到它返回您想要的所有内容.

document.getElementsByClassName('inline-text-org');

退货

[

曼彻斯特大学

,<div class="inline-text-org" title="加州大学欧文分校">加州大学......</div>等等...

您可以使用 ghost.py 在现实生活 DOM 中通过 Python 运行 JavaScript.

这真的很酷:

from ghost import Ghost鬼 = 鬼()页面,资源 = ghost.open('http://academic.research.microsoft.com/Search?query=lander')结果,资源 = ghost.evaluate("document.getElementsByClassName('inline-text-org');")

I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander

Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.

Are there any tricks that will make this task easy?

解决方案

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');

Returns

[<div class=​"inline-text-org" title=​"University of Manchester">​University of Manchester​</div>, 
 <div class=​"inline-text-org" title=​"University of California Irvine">​University of California ...​</div>​
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

This is really cool:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")

这篇关于用python抓取动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆