字符串到字节的Python,无需更改编码 [英] String to Bytes Python without change in encoding

查看:79
本文介绍了字符串到字节的Python,无需更改编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了这个问题,无法解决。我有这个字符串:

  data ='\xc4\xb7\x86\x17\xcd'

当我尝试对其进行编码时:

  data.encode()

我得到以下结果:

  b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d '

我只想要:

  b'\xc4\xb7\x86\x17\xcd'

任何人都知道原因以及如何解决此问题。字符串已经存储在变量中,所以我不能在其前面添加文字b。

解决方案

您不能将字符串转换为字节,或将字节转换为字符串而没有,同时考虑到编码。关于 bytes 类型的要点是与编码无关的字节序列,而 str Unicode代码点,根据设计,它们 没有唯一的字节表示形式。



因此,当您要将一个转换为另一个时,必须明确地告诉 您要使用哪种编码来执行此转换。当转换为字节时,必须说出如何将每个字符表示为字节序列。从字节转换时,必须说出将这些字节映射为字符的方法。



如果未指定编码,则 UTF-8 是默认设置,由于UTF-8普遍存在,因此这是合理的默认设置,但是它也是



如果您使用原始字符串,则'\xc4\xb7\ \x86\x17\xcd',看看这些字符代表的Unicode代码点。例如, \xc4 带有拉丁字母的拉丁字母大写字母 ,即Ä。该字符恰巧以UTF-8编码为 0xC3 0x84 ,这解释了为什么将其编码为字节后会得到什么。但例如,它在UTF-16中的编码也为 0x00C4






关于如何正确地解决此问题 ,以便获得所需的输出,没有明确的正确答案。 Kasramvd提到的解决方案也有些不完善。如果您阅读有关 raw_unicode_escape 编解码器在文档中


raw_unicode_escape



使用 \uXXXX \UXXXXXXXX 的Latin-1编码其他代码点。现有的反斜杠不会以任何方式转义。


因此,这只是一个拉丁1编码,该编码对于其外部的字符具有内置的备用功能。我认为这种后备对您的目的有些有害。对于无法表示为 \xXX 序列的Unicode字符,这可能是有问题的:

 >>> chr(256).encode('raw_unicode_escape')
b'\\u0100'

因此,代码点256在拉丁语显式地之外,这导致 raw_unicode_escape 编码改为返回字符串<$的编码字节c $ c>'\\u0100',将一个字符转换为6个字节,与原始字符无关(因为这是一个转义序列)。



因此,如果您想在这里使用Latin-1,我建议您明确使用该代码,而不要使用 raw_unicode_escape 的转义序列回退。尝试转换Latin-1区域之外的代码点时,这只会简单地导致异常:

 >> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
追溯(最近一次通话):
文件< pyshell#28>,< module>中的第1行
chr(256).encode('latin1')
UnicodeEncodeError:'latin-1'编解码器无法在位置0编码字符'\u0100':序数不在范围内(256)

当然,Latin-1区域之外的代码点是否会对您造成问题,取决于您在哪里该字符串实际上来自。但是,如果您可以保证输入内容仅包含有效的Latin-1字符,那么很可能您实际上根本不需要在其中使用字符串。由于实际上是在处理某种字节,因此应该首先检查一下是否不能简单地将这些值作为字节检索。这样一来,您就不会在其中引入两个编码级别,在那里您可能会通过误解输入来破坏数据。

I have this issue and I can't figure out how to solve it. I have this string:

data = '\xc4\xb7\x86\x17\xcd'

When I tried to encode it:

data.encode()

I get this result:

b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'

I only want:

b'\xc4\xb7\x86\x17\xcd'

Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.

解决方案

You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.

So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.

If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.

If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.


As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:

raw_unicode_escape

Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.

So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:

>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'

So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).

So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:

>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)

Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.

这篇关于字符串到字节的Python,无需更改编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆