如何解析Java的脚本包含[动态]的网页使用[HTML] Python的? [英] How to Parse Java-script contains[dynamic] on web-page[html] using Python?

查看:302
本文介绍了如何解析Java的脚本包含[动态]的网页使用[HTML] Python的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立一个蜘蛛,我用美丽的汤解析包含特定URL的。
现在,一些网站都在使用Java的脚本来显示动态包含将显示给用户,一旦某些动作[点击或时间]发生。
美丽的汤只是解析静态包含它是Java脚本标签已经用完了。
我想Java的脚本运行后包含的内容。
有没有办法做到这一点?

I am building a spider and I am using Beautiful soup to parse the contain of particular URL. Now, some sites are using Java-script to show dynamic contain which will be shown to user once some action [clicking or time] happens. Beautiful soup just parse the static contain which is before the java-script tag has run. I want the contain after java-script run. Is there any way to do this?

我能想到的一个办法:抓住的网址,打开浏览器并运行此URL和java脚本标记为好。然后这个URL传递到美丽的汤,可以看到包含Java的脚本[动态包含]已经产生哪些。但是,如果我爬行数以百万计的链接,那么这个解决是没有用的。如果有一些内置的模块,可它可以生成动态包含HTML页面的前手。

I can think of one way: Grab the url, open the browser and run this URL and java-script tags as well. And then pass this url to Beautiful soup, which can see contains which java-script[dynamic contains] has produced. However, if I am crawling millions of links then this solution is not useful. If there is some in-built module available which can generate dynamic contain of the Html page before hand.

推荐答案

对于准确分析从网页JavaScript的增强内容最好的办法是通过浏览器引擎加载页面。幸运的是,有一些方法可以在Python自动执行此。

Your best bet for accurately parsing Javascript-enhanced content from web pages is to load the page via a browser engine. Luckily there are ways to automate this in Python.

我已经受够了最成功的方法是使用 pywebkitgtk项目,它可以让您以编程方式创建和Python应用程序中的WebKit浏览器引擎的控制实例。我还使用 jswebkit模块以简化在页面背景的Javascript执行。

The method I've had the most success with is to use the pywebkitgtk project which lets you programmatically create and control instances of the Webkit browser engine from within a Python application. I also use the jswebkit module to simplify execution of Javascript in the page context.

另一个选择是 PyQt4中的QtWebKit的类我'已经只用于实验。

Another option is PyQt4's QtWebKit class which I've only used for experimentation.

下面是一个使用pywebkitgtk和jswebkit一起提取WebKit的渲染页面数据的工作示例。在生产环境中你要并行渲染每个运行几个这些处理器中,其自身的点¯x虚拟帧缓冲区(Xvfb来)。

Here is a working example of using pywebkitgtk and jswebkit together to extract data from a Webkit-rendered page. In a production environment you'll want to run several of these processors in parallel, each rendering to its own instance of the X virtual framebuffer (Xvfb).

import os

import gtk
import jswebkit
import lxml.html
import pygtk
import webkit

def load_finished(view, frame):
    # called when the document finishes loading
    if frame != view.get_main_frame():
        return
    ctx = jswebkit.JSContext(frame.get_global_context())
    res = ctx.EvaluateScript('window.location.href')
    print res
    res = ctx.EvaluateScript('document.body.innerHTML')
    tree = lxml.html.fromstring(res)
    print tree.xpath('//input[@type="submit"]')

# initialization
pygtk.require20()
gtk.gdk.threads_init()

# create the webview and hook up callbacks to signals
view = webkit.WebView()
view.set_size_request(1024, 768)
view.connect('load-finished', load_finished)

# configure the webview
props = view.get_settings()
props.set_property('enable-java-applet', False)
props.set_property('enable-plugins', False)
props.set_property('enable-page-cache', False)

# create a window to host the webview
win = gtk.Window()
win.add(view)
win.show_all()

# open google front page
view.open('http://www.google.com')

# spin, processing gtk events
while True:
    try:
        while gtk.events_pending():
            gtk.main_iteration(False)
    except KeyboardInterrupt:
        break

输出示例:

http://www.google.com/
[<InputElement 2a64a78 name='btnG' type='submit'>, <InputElement 2a64bb0 name='btnG' type='submit'>, <InputElement 2a64ae0 name='btnI' type='submit'>]

这篇关于如何解析Java的脚本包含[动态]的网页使用[HTML] Python的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆