如何使用 Python 脚本下载完整的网页? [英] How to download a full webpage with a Python script?

查看:19
本文介绍了如何使用 Python 脚本下载完整的网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我有一个只能下载给定页面的 HTML 的脚本.

Currently I have a script that can only download the HTML of a given page.

现在我要下载网页的所有文件,包括HTMLCSSJS图像文件(与任何网站的 ctrl-s 相同).

Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).

我当前的代码是:

import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")

我访问了很多问题,但他们都只下载了HTML.

I visited many questions but they are all only downloading the HTML.

推荐答案

以下实现使您能够获取子 HTML 网站.它可以进一步开发以获得您需要的其他文件.我为您设置了 depth 变量以设置您要解析到的最大 sub_websites.

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 版本,2019.这可能会为某人节省一些时间:

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )

这篇关于如何使用 Python 脚本下载完整的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆