如何保存“完整网页"不仅仅是使用 Python 的基本 html [英] How to save "complete webpage" not just basic html using Python
问题描述
我正在使用以下代码使用 Python 保存网页:
I am using following code to save webpage using Python:
import urllib
import sys
from bs4 import BeautifulSoup
url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')
问题:此代码将 html 保存为基本 html,没有 javascripts、图像等.我想将网页保存为完整的(就像我们在浏览器中有选项一样)
Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)
更新:我现在使用以下代码来保存 webapge 的所有 js/images/css 文件,以便它可以保存为完整的网页,但我的输出 html 仍然像基本 html 一样保存:
Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:
import pycurl
import StringIO
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()
推荐答案
Try emulating your browser with selenium. This script will pop up the save as
dialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
br = webdriver.Firefox()
br.get('http://www.google.com/')
save_me = ActionChains(br).key_down(Keys.CONTROL)
.key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()
此外,我认为遵循 @Amber 获取链接资源的建议可能更简单,因此是更好的解决方案.不过,我认为使用 selenium 是一个很好的起点,因为 br.page_source
将为您提供整个 dom 以及由 javascript 生成的动态内容.
Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_source
will get you the entire dom along with the dynamic content generated by javascript.
这篇关于如何保存“完整网页"不仅仅是使用 Python 的基本 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!