Python的urllib.request里和UTF8解码问题 [英] Python urllib.request and utf8 decoding question

查看:328
本文介绍了Python的urllib.request里和UTF8解码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写抓住一个网页,并显示在Web浏览器(像一个代理)的HTML文件,一个简单的Python CGI脚本。下面是脚本:

 #!的/ usr / bin中/ env的python3.0进口urllib.request里网站=htt​​p://reddit.com/
网站= urllib.request.urlopen(网站)
网站= site.read()
网站= site.de code('UTF8')打印(内容类型:text / html的\\ n \\ n)
打印(网站)

在命令行中运行时,此脚本工作正常,但是当它到达使用Web浏览器查看它,它显示了一个空白页。下面是我在Apache的error_log中得到错误:

 回溯(最后最近一次调用):
  文件/home/public/projects/proxy/script.cgi,11号线,上述<&模块GT;
    打印(网站)
  文件/usr/local/lib/python3.0/io.py,1491线,在写
    B = EN coder.en code(S)
  文件/usr/local/lib/python3.0/encodings/ascii.py22行,在连接code
    返回codecs.ascii_en code(输入,self.errors)[0]
UNI $ C $岑codeError:ASCIIcodeC无法连接code字符位置33777'\\ u2019':在范围序数不(128)


解决方案

当你在命令行打印,您打印的Uni code字符串到终端。该终端具有编码,因此Python会带code你的Uni code字符串到该编码。这将正常工作。

当您在CGI使用它,你最终打印到标准输出,其中没有一个编码。因此蟒蛇试图连接code用ASCII字符串。这种失败,因为ASCII不包含您尝试打印的所有字符,让您得到上述错误。

对此的解决办法是EN code您的字符串成某种编码(为什么不是UTF8?),并在头也这么说。

因此​​,像这样:

  sys.stdout.buffer.write(B内容类型:text / html的;编码= UTF-8 \\ n \\ n)#不是100%肯定对拼写。
sys.stdout.buffer.write(site.en code('UTF8'))

在Python 2中,这会工作,以及:

 打印(内容类型:text / html的;编码= UTF-8 \\ n \\ n)#不是100%肯定对拼写。
打印(site.en code('UTF8'))

但是Python 3的EN $ C $光盘字节的数据之下,所以它不会打印好。

当然,你会发现,你现在先德$从UTF8 C $ c和再重新连接code吧。你并不需要做的,严格来说。但是,如果你要修改的HTML,它实际上可能是一个好主意,这样做,并保持统一code所有修改。

I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

解决方案

When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.

When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.

The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.

So something like this:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

Under Python 2, this would work as well:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

But under Python 3 the encoded data in bytes, so it won't print well.

Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.

这篇关于Python的urllib.request里和UTF8解码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆