如何保存“完整的网页”不只是使用Python的基本html [英] How to save "complete webpage" not just basic html using Python
问题描述
我使用以下代码来保存使用Python的网页:
import urllib
import sys
from bs4 import BeautifulSoup
url ='http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test .html')
问题:此代码将html保存为基本html JavaScript,图片等。我想将网页保存为完整的(如我们在浏览器中有选项)
更新:
我是现在使用下面的代码来保存webapge的所有js / images / css文件,这样它就可以保存为完整的网页,但是我的输出html仍然保存为基本html:
import pycurl
import StringIO
c = pycurl.Curl()
c.setopt(pycurl.URL,http:// www.vodafone.de/privat/tarife/red-smartphone-tarife.html)
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION,b.write)
c.setopt(pycurl.FOLL OWLOCATION,1)
c.setopt(pycurl.MAXREDIRS,5)
c.perform()
html = b.getvalue()
#print html
fh = open(file.html,w)
fh.write(html)
fh.close()
试着用硒。该脚本将弹出网页的另存为
对话框。您将仍然需要弄清楚如何模拟按下输入以便下载,因为文件对话框超出了硒的范围(您如何操作也取决于操作系统)。
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
br = webdriver.Firefox()
br.get('http://www.google.com/')
save_me = ActionChains(br).key_down(Keys.CONTROL)\\ \\
.key_down('s')。key_up(Keys.CONTROL).key_up('s')
save_me.perform()
另外,我认为遵循 @Amber 关于抓取链接资源的建议可能更简单,因此是更好的解决方案。尽管如此,我认为使用硒是一个很好的起点,因为 br.page_source
会让您将整个dom与由javascript生成的动态内容结合在一起。
I am using following code to save webpage using Python:
import urllib
import sys
from bs4 import BeautifulSoup
url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')
Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)
Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:
import pycurl
import StringIO
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()
Try emulating your browser with selenium. This script will pop up the save as
dialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
br = webdriver.Firefox()
br.get('http://www.google.com/')
save_me = ActionChains(br).key_down(Keys.CONTROL)\
.key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()
Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_source
will get you the entire dom along with the dynamic content generated by javascript.
这篇关于如何保存“完整的网页”不只是使用Python的基本html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!