为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as '\xe2\x80\x93' in Python?
问题描述
具体来说,\xe2\x80\x93
中的每个转义符都有什么作用,为什么它需要3个转义符?尝试自行解码会导致意外的数据结尾"错误.
Specifically, what does each escape in \xe2\x80\x93
do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.
>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data
推荐答案
您有 UTF-8字节,这是 codec (一种将文本表示为计算机可读数据的标准). U + 2013 EN-DASH代码点在编码为该编解码器时,将编码为这3个字节.
You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.
尝试仅解码一个像UTF-8这样的字节是行不通的,因为在UTF-8标准中,一个字节本身不具有含义.在UTF-8编码方案中,\xe2
字节用于Unicode标准中U + 2000和U + 2FFF之间的所有代码点(将全部用另外的2个字节进行编码).总共就是4095个代码点.
Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2
byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.
Python表示bytes
对象中的值的方式使您可以通过将值复制回Python脚本或终端来重现该值.然后用\xhh
十六进制转义表示不能打印的ASCII内容.这两个字符构成字节的十六进制值,即介于0到255之间的整数.
Python represents values in a bytes
object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh
hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.
十六进制是一种非常有用的表示字节的方式,因为您可以表示2对4字节,每对都有一个字符,一个数字在0-F范围内.
Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.
\xe2\x80\x93
表示存在三个字节,十六进制值E2、80和93,或者分别为十进制的226、128和147. UTF-8标准告诉解码器获取第一个字节的最后4位,以及第二个和第三个字节中的每个字节的最后6个字节(其余位用于表示您要处理哪种类型的字节以防出错)处理).然后,这些4 + 6 + 6 == 16位将对十六进制值2013(二进制0010 000000 010011
)进行编码.
\xe2\x80\x93
then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011
in binary).
您可能想了解编解码器(编码)和Unicode之间的区别; UTF-8是可以处理所有Unicode标准的编解码器,但不是同一回事.参见:
You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:
-
每个软件开发人员绝对,肯定地必须绝对了解Unicode和字符集(没有任何借口!)由乔尔·斯波斯基(Joel Spolsky)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
这篇关于为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!