UnicodeEncodeError:'charmap'编解码器无法编码 - 字符映射到< undefined>打印功能 [英] UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

查看:258
本文介绍了UnicodeEncodeError:'charmap'编解码器无法编码 - 字符映射到< undefined>打印功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个Python(Python 3.3)程序,以使用POST方法向网页发送一些数据。主要用于调试过程,我得到页面结果,并使用 print()函数在屏幕上显示。



代码如下所示:

  conn.request(POST,resource,params,headers)
response = conn.getresponse()
print(response.status,response.reason)
data = response.read()
print(data.decode('utf-8'));

HTTPResponse .read()方法返回一个字节元素编码页面(这是一个格式良好的UTF-8文档)似乎可以,直到我停止使用Windows的IDLE GUI,并使用Windows控制台。返回的页面有一个U + 2014字符(em-dash),打印功能在Windows GUI中转换良好(我假定代码页1252),但不在Windows控制台(代码页850)中。鉴于 strict 默认行为我得到以下错误:

  UnicodeEncodeError: 'charmap'编解码器无法编码位置10248中的字符\\\—:字符映射到< undefined> 

我可以使用这个非常丑的代码修复它:


$ b $解码('cp850'))

code>

现在,它用替换违规字符 - 。不是理想的情况(连字符应该是更好的替代),但对我的目的来说足够好。



我的解决方案有几件事我不喜欢。 >


  1. 代码对所有解码,编码和解码都是丑陋的。

  2. 它解决了问题这个案例。如果我使用一些其他编码(latin-1,cp437,back to cp1252等)来连接系统的程序,它应该识别目标编码。它不是。 (例如,当再次使用IDLE GUI时,emdash也会丢失,以前没有发生过)

  3. 如果emdash转换为连字符而不是询问会更好bang。

问题不在于emdash(我可以想到几种方法来解决这个特别的问题),但是我需要编写强大的码。我正在向页面提供来自数据库的数据,数据可以回来。我可以预见到许多其他冲突的情况:Á'U + 00c1(这在我的数据库中可能)可以转换为CP-850(DOS / Windows Console encodign for Western European Languages),但不能转换为CP-437(编码为US英文,这在许多Windows安装中是默认的)。



所以,问题:



有更好的解决方案使我的代码从输出接口编码不可知?

解决方案

我看到三个解决方案:


  1. 更改输出编码,因此将始终输出UTF-8。参见例如在Python中管道stdout时设置正确的编码,但是我无法得到这些例子。


  2. 以下示例代码使输出了解您的目标字符集。

     # -  *  - 编码:utf-8  -  *  -  
    import sys

    打印sys.stdout.encoding
    打印uStöcker.encode(sys.stdout.encoding,errors ='replace')
    打印uСтоескер.encode(sys.stdout.encoding,errors ='replace')

    此示例使用问号正确替换了我名下的任何不可打印字符。



    如果您创建自定义打印功能,例如称为 myprint ,使用这些机制对输出进行正确编码,您可以简单地用 myprint 替换打印,而无需使用整个代码看起来很丑陋。


  3. 在软件开始时,全局重置输出编码:



    http://www.macfreek.nl/memory/Encoding_of_Python_stdout 有一个很好的总结做什么改变输出编码。特别是Stdout周围的StreamWriter Wrapper一节很有趣。基本上它说要改变这样的I / O编码功能:



    在Python 2中:

     如果sys.stdout.encoding!='cp850':
    sys.stdout = codecs.getwriter('cp850')(sys.stdout,'strict')
    如果sys .stderr.encoding!='cp850':
    sys.stderr = codecs.getwriter('cp850')(sys.stderr,'strict')

    在Python 3中:

     如果sys.stdout.encoding!= 'cp850':
    sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer,'strict')
    如果sys.stderr.encoding!='cp850':
    sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer,'strict')

    如果在CGI输出HTML中使用,您可以用xmlcharrefreplace替换strict,以获取不可打印字符的HTML编码标签。



    随意修改方法,设置不同的编码,....请注意,它仍然无法输出未指定数据。所以任何数据,输入,文本必须正确地转换为unicode:

     # -  *  - 编码:utf-8  -  *  - 
    import sys
    import codecs
    sys.stdout = codecs.getwriter(iso-8859-1)(sys.stdout,'xmlcharrefreplace')
    打印uStöcker #works
    打印Stöcker.decode(utf-8)#works
    打印Stöcker#failed

    / li>


I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.

The code is like this:

conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));

the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>

I could fix it using this quite ugly code:

print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))

Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

There are several things I do not like from my solution.

  1. The code is ugly with all that decoding, encoding, and decoding.
  2. It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
  3. It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

So, the question:

Is there a nicer solution that makes my code agnostic from the output interface encoding?

解决方案

I see three solutions to this:

  1. Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.

  2. Following example code makes the output aware of your target charset.

    # -*- coding: utf-8 -*-
    import sys
    
    print sys.stdout.encoding
    print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
    print u"Стоескер".encode(sys.stdout.encoding, errors='replace')
    

    This example properly replaces any non-printable character in my name with a question mark.

    If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.

  3. Reset the output encoding globally at the begin of the software:

    The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:

    In Python 2:

    if sys.stdout.encoding != 'cp850':
      sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
    if sys.stderr.encoding != 'cp850':
      sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
    

    In Python 3:

    if sys.stdout.encoding != 'cp850':
      sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
    if sys.stderr.encoding != 'cp850':
      sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
    

    If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.

    Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:

    # -*- coding: utf-8 -*-
    import sys
    import codecs
    sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
    print u"Stöcker"                # works
    print "Stöcker".decode("utf-8") # works
    print "Stöcker"                 # fails
    

这篇关于UnicodeEncodeError:'charmap'编解码器无法编码 - 字符映射到&lt; undefined&gt;打印功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆