如何保存“完整的网页”不只是使用Python的基本html [英] How to save "complete webpage" not just basic html using Python

查看:965
本文介绍了如何保存“完整的网页”不只是使用Python的基本html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码来保存使用Python的网页:

  import urllib 
import sys
from bs4 import BeautifulSoup

url ='http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test .html')

问题:此代码将html保存为基本html JavaScript,图片等。我想将网页保存为完整的(如我们在浏览器中有选项)

更新
我是现在使用下面的代码来保存webapge的所有js / images / css文件,这样它就可以保存为完整的网页,但是我的输出html仍然保存为基本html:

  import pycurl 
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL,http:// www.vodafone.de/privat/tarife/red-smartphone-tarife.html)

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION,b.write)
c.setopt(pycurl.FOLL OWLOCATION,1)
c.setopt(pycurl.MAXREDIRS,5)
c.perform()
html = b.getvalue()
#print html
fh = open(file.html,w)
fh.write(html)
fh.close()


解决方案

试着用。该脚本将弹出网页的另存为对话框。您将仍然需要弄清楚如何模拟按下输入以便下载,因为文件对话框超出了硒的范围(您如何操作也取决于操作系统)。

  from selenium import webdriver 
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)\\ \\
.key_down('s')。key_up(Keys.CONTROL).key_up('s')
save_me.perform()

另外,我认为遵循 @Amber 关于抓取链接资源的建议可能更简单,因此是更好的解决方案。尽管如此,我认为使用硒是一个很好的起点,因为 br.page_source 会让您将整个dom与由javascript生成的动态内容结合在一起。


I am using following code to save webpage using Python:

import urllib
import sys
from bs4 import BeautifulSoup

url = 'http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html'
f = urllib.urlretrieve(url,'test.html')

Problem: This code saves html as basic html without javascripts, images etc. I want to save webpage as complete (Like we have option in browser)

Update: I am using following code now to save all the js/images/css files of webapge so that it can be saved as complete webpage but still my output html is getting saved like basic html:

import pycurl
import StringIO

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html")

b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
html = b.getvalue()
#print html
fh = open("file.html", "w")
fh.write(html)
fh.close()

解决方案

Try emulating your browser with selenium. This script will pop up the save as dialog for the webpage. You will still have to figure out how to emulate pressing enter for download to start as the file dialog is out of selenium's reach (how you do it is also OS dependent).

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

br = webdriver.Firefox()
br.get('http://www.google.com/')

save_me = ActionChains(br).key_down(Keys.CONTROL)\
         .key_down('s').key_up(Keys.CONTROL).key_up('s')
save_me.perform()

Also I think following @Amber suggestion of grabbing the the linked resources may be a simpler, thus a better solution. Still, I think using selenium is a good starting point as br.page_source will get you the entire dom along with the dynamic content generated by javascript.

这篇关于如何保存“完整的网页”不只是使用Python的基本html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆