为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as '\xe2\x80\x93' in Python?

查看:1234
本文介绍了为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

具体来说,\xe2\x80\x93中的每个转义符都有什么作用,为什么它需要3个转义符?尝试自行解码会导致意外的数据结尾"错误.

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

推荐答案

您有 UTF-8字节,这是 codec (一种将文本表示为计算机可读数据的标准). U + 2013 EN-DASH代码点在编码为该编解码器时,将编码为这3个字节.

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

尝试仅解码一个像UTF-8这样的字节是行不通的,因为在UTF-8标准中,一个字节本身不具有含义.在UTF-8编码方案中,\xe2字节用于Unicode标准中U + 2000和U + 2FFF之间的所有代码点(将全部用另外的2个字节进行编码).总共就是4095个代码点.

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

Python表示bytes对象中的值的方式使您可以通过将值复制回Python脚本或终端来重现该值.然后用\xhh十六进制转义表示不能打印的ASCII内容.这两个字符构成字节的十六进制值,即介于0到255之间的整数.

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

十六进制是一种非常有用的表示字节的方式,因为您可以表示2对4字节,每对都有一个字符,一个数字在0-F范围内.

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

\xe2\x80\x93表示存在三个字节,十六进制值E2、80和93,或者分别为十进制的226、128和147. UTF-8标准告诉解码器获取第一个字节的最后4位,以及第二个和第三个字节中的每个字节的最后6个字节(其余位用于表示您要处理哪种类型的字节以防出错)处理).然后,这些4 + 6 + 6 == 16位将对十六进制值2013(二进制0010 000000 010011)进行编码.

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

您可能想了解编解码器(编码)和Unicode之间的区别; UTF-8是可以处理所有Unicode标准的编解码器,但不是同一回事.参见:

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

实用Unicode

Python Unicode HOWTO

这篇关于为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆