如何保存“完整网页"不仅仅是使用 Python 的基本 html [英] How to save "complete webpage" not just basic html using Python

查看:30
本文介绍了如何保存“完整网页"不仅仅是使用 Python 的基本 html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用以下代码使用 Python 保存网页:

I am using following code to save webpage using Python:

import urllib
import sys
from bs4 import BeautifulSoup

url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')

问题:此代码将 html 保存为基本 html,没有 javascripts、图像等.我想将网页保存为完整的(就像我们在浏览器中有选项一样)

Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)

更新:我现在使用以下代码来保存 webapge 的所有 js/images/css 文件,以便它可以保存为完整的网页,但我的输出 html 仍然像基本 html 一样保存:

Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()

推荐答案

尝试使用 selenium.这个脚本会弹出网页的另存为对话框.您仍然需要弄清楚如何模拟按下 Enter 以开始下载,因为文件对话框超出了 selenium 的范围(您如何操作也取决于操作系统).

Try emulating your browser with selenium. This script will pop up the save as dialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)
         .key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()

此外,我认为遵循 @Amber 获取链接资源的建议可能更简单,因此是更好的解决方案.不过,我认为使用 selenium 是一个很好的起点,因为 br.page_source 将为您提供整个 dom 以及由 javascript 生成的动态内容.

Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_source will get you the entire dom along with the dynamic content generated by javascript.

这篇关于如何保存“完整网页"不仅仅是使用 Python 的基本 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆