使用 Python 和 BeautifulSoup 从网页下载 .xls 文件 [英] Download .xls files from a webpage using Python and BeautifulSoup

查看:29
本文介绍了使用 Python 和 BeautifulSoup 从网页下载 .xls 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将本网站的所有.xls.xlsx.csv下载到指定文件夹中.

I want to download all the .xls or .xlsx or .csv from this website into a specified folder.

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

我研究了 mechanize、beautiful Soup、urllib2 等.Mechanize 在 Python 3 中不起作用,urllib2 在 Python 3 中也有问题,我寻找了解决方法,但我不能.所以,我目前正在尝试使用 Beautiful Soup 使其发挥作用.

I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.

我找到了一些示例代码并尝试修改它以适应我的问题,如下 -

I found some example code and attempted to modify it to suit my problem, as follows -

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

但是,当运行此代码时,不会从目标页面提取文件,也不会输出任何失败消息(例如下载失败").

However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').

  • 如何使用 BeautifulSoup 从页面中选择 Excel 文件?
  • 如何使用 Python 将这些文件下载到本地文件?

推荐答案

您的脚本目前存在的问题是:

The issues with your script as it stand are:

  1. url 有一个尾随的 /,它会在请求时给出一个无效页面,而不是列出您要下载的文件.
  2. soup.select(...) 中的 CSS 选择器正在选择带有 webpartid 属性的 div链接的文档.
  3. 您正在加入 URL 并引用它,即使页面中的链接是作为绝对 URL 给出的,并且它们不需要引用.
  4. try:...except: 块阻止您看到尝试下载文件时生成的错误.使用没有特定异常的 except 块是不好的做法,应该避免.
  1. The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
  2. The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
  3. You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
  4. The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.

您的代码的修改版本获取正确的文件并尝试下载它们,如下所示:

A modified version of your code that will get the correct files and attempt to download them is as follows:

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensions
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

然而,如果你运行它,你会注意到一个 urllib.error.HTTPError: HTTP Error 403: Forbidden 异常被抛出,即使文件可以在浏览器中下载.起初我认为这是一个推荐检查(以防止盗链),但是如果您在浏览器中观看请求(例如 Chrome 开发者工具),您会注意到最初的 http:// 请求也在那里被阻止,然后 Chrome 尝试对同一文件发出 https:// 请求.

However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser. At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.

换句话说,请求必须通过 HTTPS 才能工作(尽管页面中的 URL 是这样说的).要解决此问题,您需要在使用请求的 URL 之前将 http: 重写为 https:.以下代码将正确修改 URL 并下载文件.我还添加了一个变量来指定输出文件夹,使用 os.path.join 将其添加到文件名中:

In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:

import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

这篇关于使用 Python 和 BeautifulSoup 从网页下载 .xls 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆