HTTP错误403:禁止使用urlretrieve [英] HTTP Error 403: Forbidden with urlretrieve

查看:76
本文介绍了HTTP错误403:禁止使用urlretrieve的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载PDF,但是出现以下错误:HTTP错误403:禁止

I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden

我知道服务器由于某种原因而处于阻塞状态,但是我似乎找不到解决方法.

I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.

import urllib.request
import urllib.parse
import requests


def download_pdf(url):

full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)


try: 
url =         ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')

print('initialized')

hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}



print('HDR recieved')

req = urllib.request.Request(url, headers=hdr)

print('Header sent')

resp = urllib.request.urlopen(req)

print('Request sent')

respData = resp.read()

download_pdf(url)


print('Complete')

except Exception as e:
print(str(e))

推荐答案

您似乎已经意识到这一点;远程服务器显然正在检查用户代理标头并拒绝来自Python的urllib的请求.但是urllib.request.urlretrieve()不允许您更改HTTP标头,但是,您可以使用

You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve() doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve():

import urllib.request

opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')

您正在使用Python 3,并且现在已将这些函数视为的一部分旧版界面" URLopener已被弃用.因此,您不应在新代码中使用它们.

N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener has been deprecated. For that reason you should not use them in new code.

除了上面所述,简单地访问URL将给您带来很多麻烦.您的代码导入 requests ,但是您不使用它-您应该因为它比urllib容易得多.这对我有用:

The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests, but you don't use it - you should though because it is much easier than urllib. This works for me:

import requests

url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
    outfile.write(r.content)

这篇关于HTTP错误403:禁止使用urlretrieve的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆