在不终止Unicode的情况下,在Python 2中编码转义字符的正确方法是什么? [英] Which is the correct way to encode escape characters in Python 2 without killing Unicode?

查看:83
本文介绍了在不终止Unicode的情况下,在Python 2中编码转义字符的正确方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我认为我对Python的unicode字符串非常着迷。我试图在的转义字符中编码转义字符,而转义实际的Unicode字符。我得到这个:

I think I'm going crazy with Python's unicode strings. I'm trying to encode escape characters in a Unicode string without escaping actual Unicode characters. I'm getting this:

In [14]: a = u"Example\n"

In [15]: b = u"Пример\n"

In [16]: print a
Example


In [17]: print b
Пример


In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
\u041f\u0440\u0438\u043c\u0435\u0440\n

而我迫切需要(显然,英语示例可以按我的意愿工作):

while I desperately need (English example works as I want, obviously):

In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
Пример\n

我应该怎么做,除非转向Python 3?

What should I do, short of moving to Python 3?

PS:如下所述,我实际上是在尝试转义控制字符。

PS: As pointed out below, I'm actually seeking to escape control characters. Whether I need more than just those will have to be seen.

推荐答案

反斜杠在Unicode数据中间转义ascii控制字符是:绝对是一件有用的事情。但这不仅仅是转义,而是在您想要返回实际字符​​数据时适当地转义它们。

Backslash escaping ascii control characters in the middle of unicode data is definitely a useful thing to try to accomplish. But it's not just escaping them, it's properly unescaping them when you want the actual character data back.

在python stdlib中应该有这样做的方法,但是不是。我提交了一个错误报告: http://bugs.python.org/issue18679

There should be a way to do this in the python stdlib, but there is not. I filed a bug report: http://bugs.python.org/issue18679

,但与此同时,这是使用翻译和黑客的一种解决方法:

but in the mean time, here's a work around using translate and hackery:

tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))
tm[0] = r'\0'
tm[7] = r'\a'
tm[8] = r'\b'
tm[11] = r'\v'
tm[12] = r'\f'
tm[ord('\\')] = '\\\\'

b = u"Пример\n"
c = b.translate(tm)
print(c) ## results in: Пример\n

所有非反斜杠单字母控制字符都将以\x ##序列转义,但是如果您需要用其他不同的方式来完成转换,则转换矩阵可以做到这一点。

All the non-backslash-single-letter control characters will be escaped with the \x## sequence, but if you need something different done with those, your translation matrix can do that. This approach is not lossy though, so it works for me.

但是重新获得它也是很麻烦的,因为您不能使用

But getting it back out is hacky too because you can't just translate character sequences back into single characters using translate.

d = c.encode('latin1', 'backslashreplace').decode('unicode_escape')
print(d) ## result in Пример with trailing newline character

您实际上必须对字符进行编码使用latin1分别映射到字节,而反斜杠转义latin1不知道的Unicode字符,以便unicode_escape编解码器可以正确地重组所有内容。

you actually have to encode the characters that map to bytes individually using latin1 while backslash escaping unicode characters that latin1 doesn't know about so that the unicode_escape codec can handle reassembling everything the right way.

更新

因此,我遇到了需要在python2.7和python3.3中都使用它的情况。这是我做的(放在_compat.py模块中):

So I had a case where I needed this to work in both python2.7 and python3.3. Here's what I did (buried in a _compat.py module):

if isinstance(b"", str):                                                        
    byte_types = (str, bytes, bytearray)                                        
    text_types = (unicode, )                                                    
    def uton(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntob(x): return x                                                       
    def ntou(x): return x.decode('utf-8', 'surrogateescape')                    
    def bton(x): return x
else:                                                                           
    byte_types = (bytes, bytearray)                                             
    text_types = (str, )                                                        
    def uton(x): return x                                                       
    def ntob(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntou(x): return x                                                       
    def bton(x): return x.decode('utf-8', 'surrogateescape')    

escape_tm = dict((k, ntou(repr(chr(k))[1:-1])) for k in range(32))              
escape_tm[0] = u'\0'                                                            
escape_tm[7] = u'\a'                                                            
escape_tm[8] = u'\b'                                                            
escape_tm[11] = u'\v'                                                           
escape_tm[12] = u'\f'                                                           
escape_tm[ord('\\')] = u'\\\\'

def escape_control(s):                                                          
    if isinstance(s, text_types):                                               
        return s.translate(escape_tm)
    else:
        return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):                                                        
    if isinstance(s, text_types):                                               
        return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
    else:                                                                       
        return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')

这篇关于在不终止Unicode的情况下,在Python 2中编码转义字符的正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆