为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as '\xe2\x80\x93' in Python?

查看：1234 发布时间：2020/7/13 4:09:02 python unicode encoding utf-8

本文介绍了为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

具体来说，\xe2\x80\x93中的每个转义符都有什么作用，为什么它需要3个转义符?尝试自行解码会导致意外的数据结尾"错误.

Specifically, what does each escape in \xe2\x80\x93 do and why does it need 3 escapes? Trying to decode one by itself leads to an 'unexpected end of data' error.

>>> print(b'\xe2\x80\x93'.decode('utf-8'))
–
>>> print(b'\xe2'.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

推荐答案

您有 UTF-8字节，这是 codec (一种将文本表示为计算机可读数据的标准). U + 2013 EN-DASH代码点在编码为该编解码器时，将编码为这3个字节.

You have UTF-8 bytes, which is a codec, a standard to represent text as computer-readable data. The U+2013 EN-DASH codepoint encodes to those 3 bytes when encoded to that codec.

尝试仅解码一个像UTF-8这样的字节是行不通的，因为在UTF-8标准中，一个字节本身不具有含义.在UTF-8编码方案中，\xe2字节用于Unicode标准中U + 2000和U + 2FFF之间的所有代码点(将全部用另外的2个字节进行编码).总共就是4095个代码点.

Trying to decode just one such byte as UTF-8 doesn't work because in the UTF-8 standard that one byte does not, on its own, carry meaning. In the UTF-8 encoding scheme, a \xe2 byte is used for all codepoints between U+2000 and U+2FFF in the Unicode standard (which would all be encoded with an additional 2 bytes); thats 4095 codepoints in all.

Python表示bytes对象中的值的方式使您可以通过将值复制回Python脚本或终端来重现该值.然后用\xhh十六进制转义表示不能打印的ASCII内容.这两个字符构成字节的十六进制值，即介于0到255之间的整数.

Python represents values in a bytes object in a manner that lets you reproduce the value by copying it back into a Python script or terminal. Anything that isn't printable ASCII is then represented by a \xhh hex escape. The two characters form the hexadecimal value of the byte, an integer number between 0 and 255.

十六进制是一种非常有用的表示字节的方式，因为您可以表示2对4字节，每对都有一个字符，一个数字在0-F范围内.

Hexadecimal is a very helpful way to represent bytes because you can represent the 2 pairs of 4 bytes each with one character, a digit in the range 0 - F.

\xe2\x80\x93表示存在三个字节，十六进制值E2、80和93，或者分别为十进制的226、128和147. UTF-8标准告诉解码器获取第一个字节的最后4位，以及第二个和第三个字节中的每个字节的最后6个字节(其余位用于表示您要处理哪种类型的字节以防出错)处理).然后，这些4 + 6 + 6 == 16位将对十六进制值2013(二进制0010 000000 010011)进行编码.

\xe2\x80\x93 then means there are three bytes, with the hexadecimal values E2, 80 and 93, or 226, 128 and 147 in decimal, respectively. The UTF-8 standard tells a decoder to take the last 4 bits of the first byte, and the last 6 bytes of each of the second and third bytes (the remaining bits are used to signal what type of byte you are dealing with for error handling). Those 4 + 6 + 6 == 16 bits then encode the hex value 2013 (0010 000000 010011 in binary).

您可能想了解编解码器(编码)和Unicode之间的区别； UTF-8是可以处理所有Unicode标准的编解码器，但不是同一回事.参见:

You probably want to read up about the difference between codecs (encodings) and Unicode; UTF-8 is a codec that can handle all of the Unicode standard, but is not the same thing. See:

每个软件开发人员绝对，肯定地必须绝对了解Unicode和字符集(没有任何借口！)由乔尔·斯波斯基(Joel Spolsky)

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

实用Unicode

Python Unicode HOWTO

这篇关于为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as '\xe2\x80\x93' in Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as &#39;\xe2\x80\x93&#39; in Python?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

为什么在Python中将破折号写为'\ xe2 \ x80 \ x93'? [英] Why is the en-dash written as '\xe2\x80\x93' in Python?

登录关闭