HTTP错误403:禁止使用urlretrieve [英] HTTP Error 403: Forbidden with urlretrieve
问题描述
我正在尝试下载PDF,但是出现以下错误:HTTP错误403:禁止
I am trying to download a PDF, however I get the following error: HTTP Error 403: Forbidden
我知道服务器由于某种原因而处于阻塞状态,但是我似乎找不到解决方法.
I am aware that the server is blocking for whatever reason, but I cant seem to find a solution.
import urllib.request
import urllib.parse
import requests
def download_pdf(url):
full_name = "Test.pdf"
urllib.request.urlretrieve(url, full_name)
try:
url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf')
print('initialized')
hdr = {}
hdr = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36',
'Content-Length': '136963',
}
print('HDR recieved')
req = urllib.request.Request(url, headers=hdr)
print('Header sent')
resp = urllib.request.urlopen(req)
print('Request sent')
respData = resp.read()
download_pdf(url)
print('Complete')
except Exception as e:
print(str(e))
推荐答案
您似乎已经意识到这一点;远程服务器显然正在检查用户代理标头并拒绝来自Python的urllib的请求.但是urllib.request.urlretrieve()
不允许您更改HTTP标头,但是,您可以使用
You seem to have already realised this; the remote server is apparently checking the user agent header and rejecting requests from Python's urllib. But urllib.request.urlretrieve()
doesn't allow you to change the HTTP headers, however, you can use urllib.request.URLopener.retrieve()
:
import urllib.request
opener = urllib.request.URLopener()
opener.addheader('User-Agent', 'whatever')
filename, headers = opener.retrieve(url, 'Test.pdf')
您正在使用Python 3,并且现在已将这些函数视为的一部分旧版界面" 和URLopener
已被弃用.因此,您不应在新代码中使用它们.
N.B. You are using Python 3 and these functions are now considered part of the "Legacy interface", and URLopener
has been deprecated. For that reason you should not use them in new code.
除了上面所述,简单地访问URL将给您带来很多麻烦.您的代码导入 requests
,但是您不使用它-您应该因为它比urllib
容易得多.这对我有用:
The above aside, you are going to a lot of trouble to simply access a URL. Your code imports requests
, but you don't use it - you should though because it is much easier than urllib
. This works for me:
import requests
url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf'
r = requests.get(url)
with open('0580_s03_qp_1.pdf', 'wb') as outfile:
outfile.write(r.content)
这篇关于HTTP错误403:禁止使用urlretrieve的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!