如何使用Python脚本下载完整的网页? [英] How to download a full webpage with a Python script?
问题描述
当前,我有一个只能下载给定页面的 HTML
的脚本.
Currently I have a script that can only download the HTML
of a given page.
现在我要 下载网页的所有文件 ,包括 HTML
, CSS
, JS
和图像文件(与任何网站的ctrl-s相同).
Now I want to download all the files of the web page including HTML
, CSS
, JS
and image files (same as we get with a ctrl-s of any website).
我当前的代码是:
import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")
我访问了许多问题,但他们都只下载了 HTML
.
I visited many questions but they are all only downloading the HTML
.
推荐答案
以下实现使您可以获取HTML子网站.为了获得您需要的其他文件,可以对其进行更完善的开发.我为您设置了 depth
变量,以设置您要解析的最大sub_websites.
The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth
variable for you to set the maximum sub_websites that you want to parse to.
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print "Could not open %s" % page
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls
Python3版本,2019年.这可以为某人节省一些时间:
Python3 version, 2019. May this saves some time to somebody:
#!/usr/bin/env python
import urllib.request as urllib2
from bs4 import *
from urllib.parse import urljoin
def crawl(pages, depth=None):
indexed_url = [] # a list for the main and sub-HTML websites in the main website
for i in range(depth):
for page in pages:
if page not in indexed_url:
indexed_url.append(page)
try:
c = urllib2.urlopen(page)
except:
print( "Could not open %s" % page)
continue
soup = BeautifulSoup(c.read())
links = soup('a') #finding all the sub_links
for link in links:
if 'href' in dict(link.attrs):
url = urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('#')[0]
if url[0:4] == 'http':
indexed_url.append(url)
pages = indexed_url
return indexed_url
pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )
这篇关于如何使用Python脚本下载完整的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!