Python跟随重定向然后下载页面? [英] Python follow redirects and then download the page?

查看:34
本文介绍了Python跟随重定向然后下载页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 python 脚本,它运行良好.

导入 urllib2url = 'http://abc.com' # 在这里写网址usock = urllib2.urlopen(url)数据 = usock.read()usock.close()打印数据

但是,我提供的某些 URL 可能会将其重定向 2 次或更多次.如何在加载数据之前让 python 等待重定向完成.例如,当将上面的代码与

一起使用时

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

这相当于在谷歌搜索中点击我的幸运按钮,我得到:

<预><代码>>>>url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'>>>usick = urllib2.urlopen(url)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 126 行,在 urlopenreturn _opener.open(url, data, timeout)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第400行,打开响应 = 甲基(请求,响应)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 513 行,在 http_response'http'、请求、响应、代码、味精、hdrs)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 438 行,错误返回 self._call_chain(*args)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 372 行,在 _call_chain结果 = func(*args)文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",第 521 行,在 http_error_default引发 HTTPError(req.get_full_url(), code, msg, hdrs, fp)urllib2.HTTPError:HTTP 错误 403:禁止>>>

我已经尝试过 (url, data, timeout) 但是,我不确定该放什么.

我实际上发现如果我不重定向而只使用第一个链接的标题,我可以获取下一个重定向的位置并将其用作我的最终链接

解决方案

使用 Requests 库可能会更好,它具有更好的 API 来控制重定向处理:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

请求:

https://pypi.org/project/requests/(人类的 urllib 替代品)

I have the following python script and it works beautifully.

import urllib2

url = 'http://abc.com' # write the url here

usock = urllib2.urlopen(url)
data = usock.read()
usock.close()

print data

however, some of the URL's I give it may redirect it 2 or more times. How can I have python wait for redirects to complete before loading the data. For instance when using the above code with

http://www.google.com/search?hl=en&q=KEYWORD&btnI=1

which is the equvilant of hitting the im lucky button on a google search, I get:

>>> url = 'http://www.google.com/search?hl=en&q=KEYWORD&btnI=1'
>>> usick = urllib2.urlopen(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
>>> 

Ive tried the (url, data, timeout) however, I am unsure what to put there.

EDIT: I actually found out if I dont redirect and just used the header of the first link, I can grab the location of the next redirect and use that as my final link

解决方案

You might be better off with Requests library which has better APIs for controlling redirect handling:

https://requests.readthedocs.io/en/master/user/quickstart/#redirection-and-history

Requests:

https://pypi.org/project/requests/ (urllib replacement for humans)

这篇关于Python跟随重定向然后下载页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆