保存网页源码的固有方式 [英] Inherent way to save web page source
问题描述
我已经阅读了很多关于网页抓取的答案,这些答案涉及 BeautifulSoup、Scrapy 等.执行网页抓取.
I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.
有没有一种方法可以做到与从网络浏览器保存页面源代码相同?
Is there a way to do the equivalent of saving a page's source from a web brower?
也就是说,在 Python 中是否有一种方法可以将它指向一个网站并让它将页面的源代码保存到一个仅包含标准 Python 模块的文本文件中?
That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?
这是我要去的地方:
import urllib
f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")
#somehow save the web page source
f.close()
我知道的不多 - 但正在寻找代码来实际提取页面的源代码,以便我可以编写它.我认为 urlopen 只是建立了一个连接.
Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.
也许有一个 readlines() 等价于阅读网页的行?
Perhaps there is a readlines() equivalent for reading lines of a web page?
推荐答案
你可以试试urllib2
:
import urllib2
page = urllib2.urlopen('http://stackoverflow.com')
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)
这篇关于保存网页源码的固有方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!