bytes() 初始化程序添加一个额外的字节? [英] bytes() initializer adding an additional byte?

查看:25
本文介绍了bytes() 初始化程序添加一个额外的字节?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python3 中初始化了一个 utf-8 编码字符串:

bytes('\xc2', encoding="utf-8", errors="strict")

但是在写出来时我得到两个字节!

<预><代码>>>>s = bytes('\xc2', encoding="utf-8", errors="strict")>>>秒b'\xc3\x82'

这个额外的字节来自哪里?为什么我不能编码任何高达 254 的十六进制值(我可以理解 255 可能保留以扩展到 utf-16)?

解决方案

Unicode 代码点 "\xc2"(也可以写成 "Â"),使用 utf-8 编码时是两个字节长.如果您期望它是单字节 b'\xc2',您可能想要使用不同的编码,例如 "latin-1":

<预><代码>>>>s = bytes("\xc2", encoding="latin-1", errors="strict")>>>秒b'\xc2'

如果你真的想直接用文字创建 "\xc2" ,就没有必要用 bytes 构造函数把它变成一个 字节 实例.只需在文字上使用 b 前缀直接创建字节:

s = b"\xc2"

I initialize a utf-8 encoding string in python3:

bytes('\xc2', encoding="utf-8", errors="strict")

but on writing it out I get two bytes!

>>> s = bytes('\xc2', encoding="utf-8", errors="strict")
>>> s
b'\xc3\x82'

Where is this additional byte coming from? Why should I not be able to encode any hex value up to 254 (I can understand that 255 is potentially reserved to extend to utf-16)?

解决方案

The Unicode codepoint "\xc2" (which can also be written as "Â"), is two bytes long when encoded with the utf-8 encoding. If you were expecting it to be the single byte b'\xc2', you probably want to use a different encoding, such as "latin-1":

>>> s = bytes("\xc2", encoding="latin-1", errors="strict")
>>> s
b'\xc2'

If you area really creating "\xc2" directly with a literal though, there's no need to mess around with the bytes constructor to turn it into a bytes instance. Just use the b prefix on the literal to create the bytes directly:

s = b"\xc2"

这篇关于bytes() 初始化程序添加一个额外的字节?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆