Python urllib.request 和 utf8 解码问题 [英] Python urllib.request and utf8 decoding question

查看:59
本文介绍了Python urllib.request 和 utf8 解码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的 Python CGI 脚本,它抓取网页并在 Web 浏览器中显示 HTML 文件(充当代理).这是脚本:

I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

从命令行运行此脚本时运行良好,但当使用 Web 浏览器查看时,它显示一个空白页面.这是我在 Apache 的 error_log 中得到的错误:

This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

推荐答案

当您在命令行打印它时,您将 Unicode 字符串打印到终端.终端具有编码,因此 Python 会将您的 Unicode 字符串编码为该编码.这会正常工作.

When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.

当您在 CGI 中使用它时,您最终会打印到没有编码的标准输出.因此,Python 尝试用 ASCII 编码字符串.这失败了,因为 ASCII 不包含您尝试打印的所有字符,因此您会收到上述错误.

When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.

解决此问题的方法是将您的字符串编码为某种编码(为什么不是 UTF8?),并在标头中说明.

The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.

就像这样:

sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
sys.stdout.buffer.write(site.encode('UTF8'))

在 Python 2 下,这也能正常工作:

Under Python 2, this would work as well:

print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
print(site.encode('UTF8'))

但在 Python 3 下,编码数据以字节为单位,因此打印效果不佳.

But under Python 3 the encoded data in bytes, so it won't print well.

当然,您会注意到您现在首先从 UTF8 解码,然后重新编码.严格来说,你不需要这样做.但是,如果您想在两者之间修改 HTML,这样做实际上可能是个好主意,并将所有修改保留在 Unicode 中.

Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.

这篇关于Python urllib.request 和 utf8 解码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆