Urllib Unicode 错误,不涉及 Unicode [英] Urllib Unicode Error, no unicode involved
问题描述
我主要编辑了这篇文章的内容,以明确我的问题:
我正在编写一个程序来下载网络漫画,但在下载漫画页面时出现了这个奇怪的错误.我正在运行的代码基本上归结为以下行,后跟错误.我不知道是什么导致了这个错误,这让我很困惑.
<预><代码>>>>urllib.request.urlopen("http://abominable.cc/post/47699281401")回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/usr/lib/python3.4/urllib/request.py",第 161 行,在 urlopen返回 opener.open(url, data, timeout)文件/usr/lib/python3.4/urllib/request.py",第470行,打开响应 = 甲基(请求,响应)文件/usr/lib/python3.4/urllib/request.py",第 580 行,在 http_response'http'、请求、响应、代码、味精、hdrs)文件/usr/lib/python3.4/urllib/request.py",第 502 行,出错结果 = self._call_chain(*args)_call_chain 中的文件/usr/lib/python3.4/urllib/request.py",第 442 行结果 = func(*args)文件/usr/lib/python3.4/urllib/request.py",第 685 行,在 http_error_302返回 self.parent.open(new, timeout=req.timeout)文件/usr/lib/python3.4/urllib/request.py",第464行,打开response = self._open(req, data)文件/usr/lib/python3.4/urllib/request.py",第 482 行,在 _open'_open',请求)_call_chain 中的文件/usr/lib/python3.4/urllib/request.py",第 442 行结果 = func(*args)文件/usr/lib/python3.4/urllib/request.py",第 1211 行,在 http_open返回 self.do_open(http.client.HTTPConnection, req)文件/usr/lib/python3.4/urllib/request.py",第 1183 行,在 do_openh.request(req.get_method(), req.selector, req.data, headers)请求中的文件/usr/lib/python3.4/http/client.py",第 1137 行self._send_request(方法,网址,正文,标题)_send_request 中的文件/usr/lib/python3.4/http/client.py",第 1172 行self.putrequest(方法,网址,**跳过)文件/usr/lib/python3.4/http/client.py",第 1014 行,在 putrequest 中self._output(request.encode('ascii'))UnicodeEncodeError: 'ascii' 编解码器无法对位置 37-38 中的字符进行编码:序号不在范围内 (128)我的整个程序可以在这里找到:https://github.com/nstephenh/pycomic
我遇到了同样的问题.根本原因是远程服务器没有遵守规则.HTTP 标头应该只是 US-ASCII,但显然领先的 http 网络服务器(apache2、nginx)并不关心并发送直接的 UTF-8 编码字符串.
然而,在 http.client 中,parse_header 函数将标头获取为 iso-8859,而 urllib 中的默认 HTTPRedirectHandler 并不关心引用位置或 URI 标头,从而导致上述错误.
通过覆盖默认的 HTTPRedirectHandler 并添加三行来对抗 latin1 解码并添加路径引用,我能够解决"这两件事:
导入 urllib.request从 urllib.error 导入 HTTPError从 urllib.parse 导入 (urlparse、引用、urljoin、urlunparse)类 UniRedirectHandler(urllib.request.HTTPRedirectHandler):# 实现说明:为了避免服务器将我们发送到# 无限循环,请求对象需要跟踪我们有哪些URL#已经看过了.通过添加特定于处理程序的# 属性到请求对象.def http_error_302(self, req, fp, code, msg, headers):# 一些服务器(错误地)返回多个位置标头#(所以 URI 可能也一样).使用第一个标题.如果标题中的位置":newurl = 标题[位置"]标题中的 elif "uri":newurl = headers["uri"]别的:返回# 修复一个可能的格式错误的 URLurlparts = urlparse(newurl)# 出于安全原因,我们不允许重定向到任何其他内容# 比 http、https 或 ftp.如果 urlparts.scheme 不在 ('http', 'https', 'ftp', ''):引发 HTTPError(新网址,代码,"%s - 不允许重定向到 url '%s'" % (msg, newurl),标题,FP)如果不是 urlparts.path:urlparts = 列表(urlparts)urlparts[2] = "/"别的:urlparts = 列表(urlparts)# Header 应该只包含 US-ASCII 字符,但有些服务器确实发送 unicode 数据# 在重用之前应该引用回来# 使用"quote""取消http/client.py中parse_header()的效果前需要将字符串重新编码为iso-8859-1urlparts[2] = 引用(urlparts[2].encode('iso-8859-1'))newurl = urlunparse(urlparts)newurl = urljoin(req.full_url, newurl)# XXX 大概是想忘记当前的状态# 请求,尽管这可能与其他人的交互不佳# 处理程序也使用处理程序特定的请求属性new = self.redirect_request(req, fp, code, msg, headers, newurl)如果新的是无:返回# 循环检测# .redirect_dict 有一个关键的 url 如果之前访问过 url.如果 hasattr(req, 'redirect_dict'):访问 = new.redirect_dict = req.redirect_dictif (visited.get(newurl, 0) >= self.max_repeas 或len(访问)> = self.max_redirections):引发 HTTPError(req.full_url, 代码,self.inf_msg + msg、标题、fp)别的:访问 = new.redirect_dict = req.redirect_dict = {}已访问[新网址] = 已访问.get(新网址, 0) + 1# 在我们确定不会使用之前不要关闭 fp# 带有 HTTPError.fp.read()fp.close()返回 self.parent.open(new, timeout=req.timeout)http_error_301 = http_error_303 = http_error_307 = http_error_302[...]# 修改urllib中默认的Redirect Handler,应该在程序开始的时候做一次开瓶器 = urllib.request.build_opener(UniRedirectHandler())urllib.request.install_opener(opener)
这是python3代码,但如果需要,应该很容易适应python2.
EDIT: I've majorly edited the content of this post since the original to specify my problem:
I am writing a program to download webcomics, and I'm getting this weird error when downloading a page of the comic. The code I am running essentially boils down to the following line followed by the error. I do not know what is causing this error, and it is confusing me greatly.
>>> urllib.request.urlopen("http://abominable.cc/post/47699281401")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 502, in error
result = self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 685, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python3.4/urllib/request.py", line 464, in open
response = self._open(req, data)
File "/usr/lib/python3.4/urllib/request.py", line 482, in _open
'_open', req)
File "/usr/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 1211, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.4/urllib/request.py", line 1183, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.4/http/client.py", line 1172, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.4/http/client.py", line 1014, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-38: ordinal not in range(128)
The entirety of my program can be found here: https://github.com/nstephenh/pycomic
I was having the same problem. The root cause is that the remote server isn't playing by the rules. HTTP Headers are supposed to be US-ASCII only but apparently the leading http webservers (apache2, nginx) doesn't care and send direct UTF-8 encoded string.
However in http.client the parse_header function fetch the headers as iso-8859, and the default HTTPRedirectHandler in urllib doesn't care to quote the location or URI header, resulting in the aformentioned error.
I was able to 'work around' both thing by overriding the default HTTPRedirectHandler and adding three line to counter the latin1 decoding and add a path quote:
import urllib.request
from urllib.error import HTTPError
from urllib.parse import (
urlparse, quote, urljoin, urlunparse)
class UniRedirectHandler(urllib.request.HTTPRedirectHandler):
# Implementation note: To avoid the server sending us into an
# infinite loop, the request object needs to track what URLs we
# have already seen. Do this by adding a handler-specific
# attribute to the Request object.
def http_error_302(self, req, fp, code, msg, headers):
# Some servers (incorrectly) return multiple Location headers
# (so probably same goes for URI). Use first header.
if "location" in headers:
newurl = headers["location"]
elif "uri" in headers:
newurl = headers["uri"]
else:
return
# fix a possible malformed URL
urlparts = urlparse(newurl)
# For security reasons we don't allow redirection to anything other
# than http, https or ftp.
if urlparts.scheme not in ('http', 'https', 'ftp', ''):
raise HTTPError(
newurl, code,
"%s - Redirection to url '%s' is not allowed" % (msg, newurl),
headers, fp)
if not urlparts.path:
urlparts = list(urlparts)
urlparts[2] = "/"
else:
urlparts = list(urlparts)
# Header should only contain US-ASCII chars, but some servers do send unicode data
# that should be quoted back before reused
# Need to re-encode the string as iso-8859-1 before use of ""quote"" to cancel the effet of parse_header() in http/client.py
urlparts[2] = quote(urlparts[2].encode('iso-8859-1'))
newurl = urlunparse(urlparts)
newurl = urljoin(req.full_url, newurl)
# XXX Probably want to forget about the state of the current
# request, although that might interact poorly with other
# handlers that also use handler-specific request attributes
new = self.redirect_request(req, fp, code, msg, headers, newurl)
if new is None:
return
# loop detection
# .redirect_dict has a key url if url was previously visited.
if hasattr(req, 'redirect_dict'):
visited = new.redirect_dict = req.redirect_dict
if (visited.get(newurl, 0) >= self.max_repeats or
len(visited) >= self.max_redirections):
raise HTTPError(req.full_url, code,
self.inf_msg + msg, headers, fp)
else:
visited = new.redirect_dict = req.redirect_dict = {}
visited[newurl] = visited.get(newurl, 0) + 1
# Don't close the fp until we are sure that we won't use it
# with HTTPError.
fp.read()
fp.close()
return self.parent.open(new, timeout=req.timeout)
http_error_301 = http_error_303 = http_error_307 = http_error_302
[...]
# Change default Redirect Handler in urllib, should be done once at the beginning of the program
opener = urllib.request.build_opener(UniRedirectHandler())
urllib.request.install_opener(opener)
This is python3 code but should be easily adapted for python2 if need be.
这篇关于Urllib Unicode 错误,不涉及 Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!