如何解析包含JavaScript代码的HTML [英] How to parse html that includes javascript code

查看:107
本文介绍了如何解析包含JavaScript代码的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何解析大量使用javascript的html文档?我知道python中有一些库可以解析静态xml/html文件,而我基本上是在寻找一个程序或库(甚至是firefox插件)来读取html + javascript,执行javascript位并输出不包含javascript的html代码因此,如果在浏览器中显示,它将看起来完全相同.

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.

作为一个简单的例子

<a href="javascript:web_link(34, true);">link</a>

应替换为javascript函数返回的适当值,例如

should be replaced by the appropriate value the javascript function returns, e.g.

<a href="http://www.example.com">link</a>

一个更复杂的示例是保存的facebook html页面,上面充斥着许多javascript代码.

A more complex example would be a saved facebook html page which is littered with loads of javascript code.

可能与 如何执行"带有Node.js的HTML + Javascript页面 但是我真的需要Node.js和JSDOM吗?也有一点关系 用于呈现HTML和javascript的Python库 但我对仅呈现纯html输出不感兴趣.

Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.

推荐答案

您可以将与python结合使用,详细说明此处

You can use Selenium with python as detailed here

示例:

import xmlrpclib

# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)

# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)

import os
os.system('start run_firefox.bat')

print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()

这篇关于如何解析包含JavaScript代码的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆