使用Python进行屏幕抓取 [英] Screen scraping with Python

查看:755
本文介绍了使用Python进行屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python是否具有提供JavaScript支持的屏幕抓取库?

Does Python have screen scraping libraries that offer JavaScript support?

我一直在将 pycurl 用于简单的HTML请求,并将Java的 HtmlUnit 用于需要JavaScript支持的更复杂的请求.

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

理想情况下,我希望能够使用Python进行所有操作,但是我没有遇到任何允许我执行此操作的库.它们存在吗?

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

推荐答案

在处理静态HTML时,有很多选项,其他响应也涵盖了这些选项.但是,如果您需要JavaScript支持并希望继续使用Python,建议您使用 webkit 呈现网页(包括JavaScript),然后检查结果HTML.例如:

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

这篇关于使用Python进行屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆