bytes() 初始化程序添加一个额外的字节? [英] bytes() initializer adding an additional byte?
问题描述
我在 python3 中初始化了一个 utf-8 编码字符串:
bytes('\xc2', encoding="utf-8", errors="strict")
但是在写出来时我得到两个字节!
<预><代码>>>>s = bytes('\xc2', encoding="utf-8", errors="strict")>>>秒b'\xc3\x82'这个额外的字节来自哪里?为什么我不能编码任何高达 254 的十六进制值(我可以理解 255 可能保留以扩展到 utf-16)?
Unicode 代码点 "\xc2"
(也可以写成 "Â"
),使用 utf-8
编码时是两个字节长.如果您期望它是单字节 b'\xc2'
,您可能想要使用不同的编码,例如 "latin-1"
:
如果你真的想直接用文字创建 "\xc2"
,就没有必要用 bytes
构造函数把它变成一个 字节
实例.只需在文字上使用 b
前缀直接创建字节:
s = b"\xc2"
I initialize a utf-8 encoding string in python3:
bytes('\xc2', encoding="utf-8", errors="strict")
but on writing it out I get two bytes!
>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'
Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?
The Unicode codepoint "\xc2"
(which can also be written as "Â"
), is two bytes long when encoded with the utf-8
encoding. If you were expecting it to be the single byte b'\xc2'
, you probably want to use a different encoding, such as "latin-1"
:
>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'
If you area really creating "\xc2"
directly with a literal though, there's no need to mess around with the bytes
constructor to turn it into a bytes
instance. Just use the b
prefix on the literal to create the bytes directly:
s = b"\xc2"
这篇关于bytes() 初始化程序添加一个额外的字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!