UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>,打印函数 [英] UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

查看:21
本文介绍了UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 <undefined>,打印函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Python (Python 3.3) 程序,以使用 POST 方法将一些数据发送到网页.主要用于调试过程,我获取页面结果并使用 print() 函数将其显示在屏幕上.

I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.

代码是这样的:

conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));

HTTPResponse .read() 方法返回一个 bytes 元素编码页面(这是一个格式良好的 UTF-8 文档)它看起来还不错,直到我停止使用 Windows 的 IDLE GUI 并改用 Windows 控制台.返回的页面有一个 U+2014 字符(长破折号),打印功能在 Windows GUI(我假设代码页 1252)中转换得很好,但在 Windows 控制台(代码页 850)中没有.鉴于 strict 默认行为,我收到以下错误:

the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character 'u2014' in position 10248: character maps to <undefined>

我可以使用这个非常难看的代码来修复它:

I could fix it using this quite ugly code:

print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))

现在它用 ? 替换了有问题的字符—".不是理想的情况(连字符应该是更好的替代品)但足以满足我的目的.

Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

我的解决方案有几处我不喜欢.

There are several things I do not like from my solution.

  1. 所有解码、编码和解码的代码都很丑陋.
  2. 它解决了这种情况下的问题.如果我为使用其他编码(latin-1、cp437、回到 cp1252 等)的系统移植程序,它应该能够识别目标编码.它不是.(例如,再次使用 IDLE GUI 时,emdash 也丢失了,这在以前没有发生过)
  3. 如果将 emdash 翻译成连字符而不是审讯爆炸会更好.

问题不在于 emdash(我可以想到几种方法来解决这个特别的问题),但我需要编写健壮的代码.我正在使用数据库中的数据为页面提供数据,并且该数据可以返回.我可以预见许多其他冲突情况:Á"U+00c1(在我的数据库中是可能的)可以转换为 CP-850(西欧语言的 DOS/Windows 控制台编码),但不能转换为 CP-437(美国的编码)英语,这是许多 Windows 安装的默认设置).

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

所以,问题:

是否有更好的解决方案使我的代码与输出接口编码无关?

Is there a nicer solution that makes my code agnostic from the output interface encoding?

推荐答案

我看到了三个解决方案:

I see three solutions to this:

  1. 改变输出编码,所以它总是输出UTF-8.见例如在 Python 中管道标准输出时设置正确的编码,但我无法让这些示例工作.

  1. Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.

以下示例代码使输出了解您的目标字符集.

Following example code makes the output aware of your target charset.

# -*- coding: utf-8 -*-
import sys

print sys.stdout.encoding
print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
print u"Стоескер".encode(sys.stdout.encoding, errors='replace')

这个例子正确地用问号替换了我名字中的任何不可打印的字符.

This example properly replaces any non-printable character in my name with a question mark.

如果您创建自定义打印功能,例如称为 myprint,使用该机制正确编码输出,您可以在必要时简单地用 myprint 替换 print,而不会使整个代码看起来很丑.

If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.

在软件开始时全局重置输出编码:

Reset the output encoding globally at the begin of the software:

页面 http://www.macfreek.nl/memory/Encoding_of_Python_stdout总结如何更改输出编码.特别是StreamWriter Wrapper around Stdout"一节很有趣.本质上它说要像这样更改 I/O 编码函数:

The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:

在 Python 2 中:

In Python 2:

if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')

在 Python 3 中:

In Python 3:

if sys.stdout.encoding != 'cp850':
  sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'cp850':
  sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')

如果用于 CGI 输出 HTML,您可以将 'strict' 替换为 'xmlcharrefreplace' 以获得不可打印字符的 HTML 编码标签.

If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.

随意修改方法,设置不同的编码,....注意它仍然无法输出非指定的数据.因此,任何数据、输入、文本都必须正确转换为 unicode:

Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:

# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
print u"Stöcker"                # works
print "Stöcker".decode("utf-8") # works
print "Stöcker"                 # fails

这篇关于UnicodeEncodeError: 'charmap' 编解码器无法编码 - 字符映射到 &lt;undefined&gt;,打印函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆