使用Python和BeautifulSoup从网页下载.xls文件 [英] Download .xls files from a webpage using Python and BeautifulSoup

查看:69
本文介绍了使用Python和BeautifulSoup从网页下载.xls文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从该网站下载所有.xls.xlsx.csv到指定的文件夹中.

I want to download all the .xls or .xlsx or .csv from this website into a specified folder.

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

我研究了机械化,漂亮的汤,urllib2等.机械化在Python 3中不起作用,urllib2在Python 3中也有问题,我寻找了解决方法,但我做不到.因此,我目前正在尝试使用Beautiful Soup来使它正常工作.

I have looked into mechanize, beautiful soup, urllib2 etc. Mechanize does not work in Python 3, urllib2 also had problems with Python 3, I looked for workaround but I couldn't. So, I am currently trying to make it work using Beautiful Soup.

我找到了一些示例代码,并尝试对其进行修改以适合我的问题,如下所示-

I found some example code and attempted to modify it to suit my problem, as follows -

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

但是,在运行此代码时,不会从目标页面提取文件,也不会输出任何失败消息(例如,无法下载").

However, when run this code does not extract the files from the target page, nor output any failure message (e.g. 'failed to download').

  • 如何使用BeautifulSoup从页面中选择Excel文件?
  • 如何使用Python将这些文件下载到本地文件?

推荐答案

您的脚本当前存在的问题是:

The issues with your script as it stand are:

  1. url的尾随/会在被请求时显示无效页面,而不列出您要下载的文件.
  2. soup.select(...)中的CSS选择器正在选择属性为webpartiddiv,该属性在该链接文档的任何地方都不存在.
  3. 即使链接在页面中以绝对URL形式提供,也不需要引用.您正在加入URL并引用它.
  4. try:...except:块阻止您看到尝试下载文件时生成的错误.在没有特定异常的情况下使用except块是一种不好的做法,应避免使用.
  1. The url has a trailing / which gives an invalid page when requested, not listing the files you want to download.
  2. The CSS selector in soup.select(...) is selecting div with the attribute webpartid which does not exist anywhere in that linked document.
  3. You are joining the URL and quoting it, even though the links are given in the page as absolute URLs and they do not need quoting.
  4. The try:...except: block is stopping you seeing the errors generated when trying to download the file. Using an except block without a specific exception is bad practise and should be avoided.

修改后的代码版本,将 获取正确的文件并尝试下载它们,如下所示:

A modified version of your code that will get the correct files and attempt to download them is as follows:

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensions
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

但是,如果运行此命令,则会注意到抛出urllib.error.HTTPError: HTTP Error 403: Forbidden异常,即使该文件可在浏览器中下载. 起初我以为这是引荐检查(以防止出现热链接),但是如果您在浏览器中观看请求(例如Chrome开发者工具),则会注意到 最初的http://请求也在那里被阻止,然后Chrome尝试对同一文件进行https://请求.

However, if you run this you'll notice that a urllib.error.HTTPError: HTTP Error 403: Forbidden exception is thrown, even though the file is downloadable in the browser. At first I thought this was a referral check (to prevent hotlinking), however if you watch at the request in your browser (e.g. Chrome Developer tools) you'll notice that the initial http:// request is blocked there also, and then Chrome attempts a https:// request for the same file.

换句话说,请求必须通过HTTPS才能工作(尽管页面中的URL表示了什么).要解决此问题,您需要在使用请求的URL之前将http:重写为https:.以下代码将正确修改URL并下载文件.我还添加了一个变量来指定输出文件夹,使用os.path.join将其添加到文件名:

In other words, the request must go via HTTPS to work (despite what the URLs in the page say). To fix this you will need to rewrite the http: to https: before using the URL for the request. The following code will correctly modify the URLs and download the files. I've also added an variable to specify the output folder, which is added to the filename using os.path.join:

import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

这篇关于使用Python和BeautifulSoup从网页下载.xls文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆