在python 2.7中打印UTF-8字符 [英] print UTF-8 character in Python 2.7

查看:128
本文介绍了在python 2.7中打印UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我打开,阅读和输出的方式.该文件是用于Unicode字符的UTF-8编码文件.我想打印前10个UTF-8字符,但是下面代码段的输出显示了10个无法识别的怪异字符.想知道是否有人对如何正确打印有任何想法?谢谢.

Here is how I open, read and output. The file is an UTF-8 encoded file for unicode characters. I want to print the first 10 UTF-8 characters, but the output from below code snippet print 10 weird unrecognized characters. Wondering if anyone have any ideas how to print correctly? Thanks.

   with open(name, 'r') as content_file:
        content = content_file.read()
        for i in range(10):
            print content[i]

10个怪异角色中的每个角色都像这样

Each of the 10 weird character looks like this,

致谢, 林

推荐答案

将Unicode代码点(字符)编码为UTF-8时,某些代码点将转换为单个字节,但是许多代码点会变成一个以上的字节.标准7位ASCII范围内的字符将被编码为单个字节,但是更多的特殊字符通常将需要更多的字节进行编码.

When Unicode codepoints (characters) are encoded as UTF-8 some codepoints are converted to a single byte, but many codepoints become more than one byte. Characters in the standard 7 bit ASCII range will be encoded as single bytes, but more exotic characters will generally require more bytes to encode.

所以您得到那些奇怪的字符,因为您将这些多字节的UTF-8序列分解为单个字节.有时这些字节将对应于普通的可打印字符,但通常不会,因此您可以打印出来.

So you are getting those weird characters because you are breaking up those multi-byte UTF-8 sequences into single bytes. Sometime those bytes will correspond to normal printable characters, but often they won't so you get � printed instead.

这是使用©,®和™字符的简短演示,这两个字符在UTF-8中分别被编码为2、2和3个字节.我的终端设置为使用UTF-8.

Here's a short demo using the ©, ®, and ™ characters, which are encoded as 2, 2, and 3 bytes respectively in UTF-8. My terminal is set to use UTF-8.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)

输出

© ® ™ 9                                                                                                                                        
� '\xc2'                                                                                                                                       
� '\xa9'                                                                                                                                       
  ' '
� '\xc2'
� '\xae'
  ' '
� '\xe2'
� '\x84'
� '\xa2'
© ® ™ 5

Stack Overflow联合创始人Joel Spolsky在Unicode上写了一篇很好的文章:绝对最小值每个软件开发人员绝对,肯定必须了解Unicode和字符集(没有任何借口!)

Stack Overflow co-founder, Joel Spolsky, has written a good article on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

您还应该查看Python中的 Unicode HOWTO 文章文档,以及Ned Batchelder的实用Unicode 文章,又称"Unipain".

You should also take a look at the Unicode HOWTO article in the Python docs, and Ned Batchelder's Pragmatic Unicode article, aka "Unipain".

这是从UTF-8编码的字节字符串中提取单个字符的简短示例.正如我在评论中提到的那样,要正确执行此操作,您需要知道每个字符被编码为多少个字节.

Here's a short example of extracting individual characters from a UTF-8 encoded byte string. As I mention in the comments, to do this correctly you need to know how many bytes each of the characters is encoded as.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w

输出

0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [™]

FWIW,这是该代码的Python 3版本:

FWIW, here's a Python 3 version of that code:

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

如果我们不知道UTF-8字符串中字符的字节宽度,那么我们需要做更多的工作.每个UTF-8序列都会在第一个字节中对序列的宽度进行编码,如 Wikipedia中所述关于UTF-8的文章.

If we don't know the byte widths of the characters in our UTF-8 string then we need to do a little more work. Each UTF-8 sequence encodes the width of the sequence in the first byte, as described in the Wikipedia article on UTF-8.

下面的Python 2演示展示了如何提取宽度信息.它产生的输出与之前的两个摘要相同.

The following Python 2 demo shows how you can extract that width information; it produces the same output as the two previous snippets.

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)


utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w

通常,做这种事情不是必需的:只使用提供的解码方法即可.

Generally, it should not be necessary to do this sort of thing: just use the provided decoding methods.

出于好奇,这里是get_width的Python 3版本,以及一个手动解码UTF-8字节串的函数.

For the curious, here's a Python 3 version of get_width, and a function that decodes a UTF-8 bytestring manually.

def get_width(b):
    if b <= 0x7f:
        return 1
    elif 0x80 <= b <= 0xbf:
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif 0xc0 <= b <= 0xdf:
        return 2
    elif 0xe0 <= b <= 0xef:
        return 3
    elif 0xf0 <= b <= 0xf7:
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)

def decode_utf8(utfbytes):
    start = 0
    uni = []
    while start < len(utfbytes):
        b = utfbytes[start]
        w = get_width(b)
        if w == 1:
            n = b
        else:
            n = b & (0x7f >> w)
            for b in utfbytes[start+1:start+w]:
                if not 0x80 <= b <= 0xbf:
                    raise ValueError('Not a continuation byte: %r' % b)
                n <<= 6
                n |= b & 0x3f
        uni.append(chr(n))
        start += w
    return ''.join(uni)


utfbytes = b'\xc2\xa9 \xc2\xae \xe2\x84\xa2'
print(utfbytes.decode('utf8'))
print(decode_utf8(utfbytes))

输出

©®™
©®™

© ® ™
© ® ™

这篇关于在python 2.7中打印UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆