Python跟随重定向然后下载页面? [英] Python follow redirects and then download the page?
问题描述
我有以下 python 脚本,它运行良好.
导入 urllib2url = 'http://abc.com' # 在这里写网址usock = urllib2.urlopen(url)数据 = usock.read()usock.close()打印数据
但是,我提供的某些 URL 可能会将其重定向 2 次或更多次.如何在加载数据之前让 python 等待重定向完成.例如,当将上面的代码与
一起使用时http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
这相当于在谷歌搜索中点击我的幸运按钮,我得到:
<预><代码>>>>url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'>>>usick = urllib2.urlopen(url)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 126 行,在 urlopenreturn _opener.open(url, data, timeout)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第400行,打开响应 = 甲基(请求,响应)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 513 行,在 http_response'http'、请求、响应、代码、味精、hdrs)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 438 行,错误返回 self._call_chain(*args)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 372 行,在 _call_chain结果 = func(*args)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 521 行,在 http_error_default引发 HTTPError(req.get_full_url(), code, msg, hdrs, fp)urllib2.HTTPError:HTTP 错误 403:禁止>>>我已经尝试过 (url, data, timeout) 但是,我不确定该放什么.
我实际上发现如果我不重定向而只使用第一个链接的标题,我可以获取下一个重定向的位置并将其用作我的最终链接
使用 Requests 库可能会更好,它具有更好的 API 来控制重定向处理:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
请求:
https://pypi.org/project/requests/(人类的 urllib 替代品)
I have the following python script and it works beautifully.
import urllib2
url = 'http://abc.com' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data. For instance when using the above code with
http://www.google.com/search?hl=en&q=KEYWORD&btnI=1
which is the equvilant of hitting the im lucky button on a google search, I get:
>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>>
Ive tried the (url, data, timeout) however, I am unsure what to put there.
EDIT: I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link
You might be better off with Requests library which has better APIs for controlling redirect handling:
https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history
Requests:
https://pypi.org/project/requests/ (urllib replacement for humans)
这篇关于Python跟随重定向然后下载页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!