Python 3:解密编码和解码方法 [英] Python 3: Demystifying encode and decode methods

查看:173
本文介绍了Python 3:解密编码和解码方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个Python的字符串:

 >>> s ='python'
>>>> len(s)
6

现在我 encode 这样的字符串:

 >>> b = s.encode('utf-8')
>>>> b16 = s.encode('utf-16')
>>> b32 = s.encode('utf-32')

从上面的操作得到的是一个字节数组 - 即 b b16 b32 只是字节数组(每个字节当然是8位长)。



但是我们编码字符串。那么这是什么意思?我们如何用原始的字节数组来附加编码的概念?



答案在于,这些字节数组中的每一个都是特定的办法。我们来看看这些数组:

 >>> [b(x)for x in b] 
['0x70','0x79','0x74','0x68','0x6f','0x6e']

>> ;> len(b)
6

这个数组表示对于每个字符我们有一个字节因为所有角色都低于127)。因此,我们可以说将字符串编码为utf-8来收集每个字符的相应代码点并将其放入数组中。如果代码点不能适合一个字节,那么utf-8将消耗两个字节。因此,utf-8可以消耗最少的字节数。

 >>> [b16]中的x的十六进制(x)
['0xff','0xfe','0x70','0x0','0x79','0x0','0x74','0x0' ,'0x0','0x6f','0x0','0x6e','0x0']

>>> len(b16)
14#(2 + 6 * 2)

编码为utf-16首先将两个字节的BOM( FF FE )放入字节数组,之后,对于每个字符,它将两个字节放入数组。 (在我们的例子中,第二个字节总是为零)

 >>> [b32]中的x的十六进制(x)
['0xff','0xfe','0x0','0x0','0x70','0x0','0x0','0x0' ,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0 0x0','0x0','0x0','0x6e','0x0','0x0','0x0']

>>> len(b32)
28#(2+ 6 * 4 + 2)

在utf-32中编码的情况下,我们首先放置BOM,然后为每个字符放置四个字节,最后我们将两个零字节放入数组。



因此,我们可以说编码过程为字符串中的每个字符收集1 2或4个字节(取决于编码名称),并为它们添加更多字节,以创建最终的结果字节数组。 p>

现在,我的问题:




  • 我对编码过程的理解正确吗我缺少某些东西?

  • 我们可以看到变量的内存表示形式 b b16 b32 实际上是一个字节列表。字符串的内存表示是什么?我们知道,当我们做一个 encode()时,每个字符的相应代码点被收集(代码点对应于编码名称)并放入一个数组或字节。当我们执行 decode()

  • 我们可以看到在utf-16和utf-32中有一个BOM是为了在utf-32编码中附加两个零字节?


解决方案

首先, UTF-32 是一个4字节的编码,所以它的BOM是一个四字节序列:

 >>>导入编解码器
>>> # >因为不同的计算机体系结构不同地对待字节顺序(称为 Endianess ),有两种变体BOM,小和大端:

 >>> codecs.BOM_UTF32_LE 
b'\xff\xfe\x00\x00'
>>>> codecs.BOM_UTF32_BE
b'\x00\x00\xfe\xff'

BOM的目的是将该顺序传递给解码器;阅读BOM并且您知道如果它是大的或小的endian。那么UTF-32字符串中最后两个空字节是最后一个编码字符的一部分。



UTF-16 BOM因此相似,因为有两种变体:

 >>> codecs.BOM_UTF16 
b'\xff\xfe'
>>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
>>>> codecs.BOM_UTF16_BE
b'\xfe\xff'

这取决于你的电脑默认情况下使用的是一种架构。



UTF-8 根本不需要BOM; UTF-8每个字符使用一个或多个字节(根据需要添加字节来编码更复杂的值),但是这些字节的顺序在标准中定义。微软认为有必要引入UTF-8 BOM(因此其记事本应用程序可能会检测到UTF-8),但由于BOM的顺序从不改变其用途,因此不鼓励。



关于Python对unicode字符串存储的内容;在Python 3.3中实际发生了变化。在3.3之前,内部在C级,Python或者存储UTF16或UTF32字节组合,这取决于Python是否使用宽字符支持进行编译(请参阅如何确定Python是否使用UCS-2或UCS-4进行编译?, UCS-2基本上是 UTF-16和UCS-4是UTF-32)。因此,每个字符需要2或4字节的内存。



从Python 3.3开始,内部表示使用最小所需的字节数来表示字符串中的所有字符。对于纯ASCII和Latin1可编码文本,使用1个字节,对于其余的 BMP 2字节,使用包含4字节以外的字符的文本。 Python根据需要在格式之间切换。因此,在大多数情况下,存储已经变得更加有效。有关详细信息,请参阅 Python 3.3中的新功能。 p>

我可以强烈建议您阅读Unicode和Python:




Let's say I have a string in Python:

>>> s = 'python'
>>> len(s)
6

Now I encode this string like this:

>>> b = s.encode('utf-8')
>>> b16 = s.encode('utf-16')
>>> b32 = s.encode('utf-32')

What I get from above operations is a bytes array -- that is, b, b16 and b32 are just arrays of bytes (each byte being 8-bit long of course).

But we encoded the string. So, what does this mean? How do we attach the notion of "encoding" with the raw array of bytes?

The answer lies in the fact that each of these array of bytes is generated in a particular way. Let's look at these arrays:

>>> [hex(x) for x in b]
['0x70', '0x79', '0x74', '0x68', '0x6f', '0x6e']

>>> len(b)
6

This array indicates that for each character we have one byte (because all the characters fall below 127). Hence, we can say that "encoding" the string to 'utf-8' collects each character's corresponding code-point and puts it into the array. If the code point can not fit in one byte then utf-8 consumes two bytes. Hence utf-8 consumes least number of bytes possible.

>>> [hex(x) for x in b16]
['0xff', '0xfe', '0x70', '0x0', '0x79', '0x0', '0x74', '0x0', '0x68', '0x0', '0x6f', '0x0', '0x6e',  '0x0']

>>> len(b16)
14     # (2 + 6*2)

Here we can see that "encoding to utf-16" first puts a two byte BOM (FF FE) into the bytes array, and after that, for each character it puts two bytes into the array. (In our case, the second byte is always zero)

>>> [hex(x) for x in b32]
['0xff', '0xfe', '0x0', '0x0', '0x70', '0x0', '0x0', '0x0', '0x79', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0x68', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0']

>>> len(b32)
28     # (2+ 6*4 + 2)

In the case of "encoding in utf-32", we first put the BOM, then for each character we put four bytes, and lastly we put two zero bytes into the array.

Hence, we can say that the "encoding process" collects 1 2 or 4 bytes (depending on the encoding name) for each character in the string and prepends and appends more bytes to them to create the final result array of bytes.

Now, my questions:

  • Is my understanding of the encoding process correct or am I missing something?
  • We can see that the memory representation of the variables b, b16 and b32 is actually a list of bytes. What is the memory representation of the string? Exactly what is stored in memory for a string?
  • We know that when we do an encode(), each character's corresponding code point is collected (code point corresponding to the encoding name) and put into an array or bytes. What exactly happens when we do a decode()?
  • We can see that in utf-16 and utf-32, a BOM is prepended, but why are two zero bytes appended in the utf-32 encoding?

解决方案

First of all, UTF-32 is a 4-byte encoding, so its BOM is a four byte sequence too:

>>> import codecs
>>> codecs.BOM_UTF32
b'\xff\xfe\x00\x00'

And because different computer architectures treat byte orders differently (called Endianess), there are two variants of the BOM, little and big endian:

>>> codecs.BOM_UTF32_LE
b'\xff\xfe\x00\x00'
>>> codecs.BOM_UTF32_BE
b'\x00\x00\xfe\xff'

The purpose of the BOM is to communicate that order to the decoder; read the BOM and you know if it is big or little endian. So, those last two null bytes in your UTF-32 string are part of the last encoded character.

The UTF-16 BOM is thus similar, in that there are two variants:

>>> codecs.BOM_UTF16
b'\xff\xfe'
>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
>>> codecs.BOM_UTF16_BE
b'\xfe\xff'

It depends on your computer architecture which one is used by default.

UTF-8 doesn't need a BOM at all; UTF-8 uses 1 or more bytes per character (adding bytes as needed to encode more complex values), but the order of those bytes is defined in the standard. Microsoft deemed it necessary to introduce a UTF-8 BOM anyway (so its Notepad application could detect UTF-8), but since the order of the BOM never varies its use is discouraged.

As for what is stored by Python for unicode strings; that actually changed in Python 3.3. Before 3.3, internally at the C level, Python either stored UTF16 or UTF32 byte combinations, depending on whether or not Python was compiled with wide character support (see How to find out if Python is compiled with UCS-2 or UCS-4?, UCS-2 is essentially UTF-16 and UCS-4 is UTF-32). So, each character either takes 2 or 4 bytes of memory.

As of Python 3.3, the internal representation uses the minimal number of bytes required to represent all characters in the string. For plain ASCII and Latin1-encodable text 1 byte is used, for the rest of the BMP 2 bytes are used, and text containing characters beyond that 4 bytes are used. Python switches between the formats as needed. Thus, storage has become a lot more efficient for most cases. For more detail see What's New in Python 3.3.

I can strongly recommend you read up on Unicode and Python with:

这篇关于Python 3:解密编码和解码方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆