utf-16 编码字符串是否需要 [0xff, 0xfe] 前缀? [英] Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

查看:41
本文介绍了utf-16 编码字符串是否需要 [0xff, 0xfe] 前缀?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重写的问题!

我正在使用需要对字符串进行unicode 编码"的供应商设备,其中每个字符以两个字节表示.我的字符串将始终基于 ASCII,所以我认为这是将我的字符串转换为供应商字符串的方法:

<预><代码>>>>b1 = 'abc'.encode('utf-16')

但是检查结果,我看到字节数组上有一个前导 [0xff, 0xfe]:

<预><代码>>>>[十六进制(b) 用于 b1 中的 b]['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

由于供应商的设备不期望 [0xff, 0xfe],我可以将其剥离...

<预><代码>>>>b2 = 'abc'.encode('utf-16')[2:]>>>[十六进制(b) for b in b2]['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

...这就是我想要的.

但真正让我惊讶的是我可以解码 b1 和 b2 并且它们都重组为原始字符串:

<预><代码>>>>b1.decode('utf-16') == b2.decode('utf-16')真的

所以我的两个相互交织的问题:

  • 编码字节头部的 0xff、0xfe 有什么意义?
  • 剥离 0xff、0xfe 前缀是否有危险,就像上面的 b2 一样?

解决方案

这个观察

<块引用>

...真正让我惊讶的是我可以解码 b1 和 b2 并且它们都重组为原始字符串:

b1.decode('utf-16') == b2.decode('utf-16')真的

建议有一个内置的默认值,因为 16 位宽的 UTF-16 代码有两种可能的安排:大端和小端.

通常,Python 会在读取时从 BOM 推断出要使用的字节序 - 因此它在写入时总是添加一个.如果你想强制一个特定的字节序,你可以使用显式编码 utf-16-leutf-16-be:

<块引用>

... 当使用这样的编码时,BOM 将自动作为第一个字符写入,并在读取文件时静默删除.这些编码有多种变体,例如用于 little-endian 和 big-endian 编码的utf-16-le"和utf-16-be",它们指定一种特定的字节顺序并且不跳过 BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)

但是如果您不使用特定的顺序,那么使用什么默认值?最初的 Unicode 提案 PEP 100 发出警告

<块引用>

注意:'utf-16' 应该通过使用并要求字节顺序标记 (BOM) 来实现文件输入/输出.
(https://www.python.org/dev/peps/pep-0100/,我妈的.)

但它对你有用.如果我们在 Python 源代码中查找它是如何管理的,我们会在 _codecsmodule.c 中找到此注释:

/* 此版本提供对字节顺序参数的访问内置 UTF-16 编解码器作为可选的第三个参数.默认为 0这意味着:使用本机字节顺序并在数据前加上物料清单标记.*/

更深入,在unicodeobject.c

/* 检查输入中的 BOM 标记(U+FEFF)并调整电流相应的字节顺序设置.在原生模式下,领先的 BOM标记被跳过,在所有其他模式下,它被复制到输出按原样流式传输(提供 ZWNBSP 字符).*/

因此,最初,字节顺序设置为您系统的默认值,当您开始解码 UTF-16 数据并跟随 BOM 时,字节顺序将设置为指定的任何内容.最后一条注释中的本机顺序"是指某个字节顺序是否已明确声明或已通过 BOM 遇到;如果两者都不是,它将使用您系统的字节序.

Rewritten question!

I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string:

>>> b1 = 'abc'.encode('utf-16')

But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray:

>>> [hex(b) for b in b1]
['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

Since the vendor's device is not expecting the [0xff, 0xfe], I can strip it off...

>>> b2 = 'abc'.encode('utf-16')[2:]
>>> [hex(b) for b in b2]
['0x61', '0x0', '0x62', '0x0', '0x63', '0x0']

... which is what I want.

But what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:

>>> b1.decode('utf-16') == b2.decode('utf-16')
True

So my two intertwined questions:

  • What is the significance of the 0xff, 0xfe on the head of the encoded bytes?
  • Is there any hazard in stripping off the 0xff, 0xfe prefix, as with b2 above?

解决方案

This observation

... what really surprises me that I can decode b1 and b2 and they both reconstitute to the original string:

b1.decode('utf-16') == b2.decode('utf-16')
True

suggests there is a built-in default, because there are two possible arrangements for the 16-bit wide UTF-16 codes: Big and Little Endian.

Normally, Python deduces the endianness to use from the BOM when reading – and so it always adds one when writing. If you want to force a specific endianness, you can use the explicit encodings utf-16-le and utf-16-be:

… when such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
(https://docs.python.org/3/howto/unicode.html#reading-and-writing-unicode-data)

But if you do not use a specific ordering, then what default gets used? The original Unicode proposal, PEP 100, warns

Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.
(https://www.python.org/dev/peps/pep-0100/, my emph.)

Yet it works for you. If we look up in the Python source code how this is managed, we find this comment in _codecsmodule.c:

/* This version provides access to the byteorder parameter of the
   builtin UTF-16 codecs as optional third argument. It defaults to 0
   which means: use the native byte order and prepend the data with a
   BOM mark.
*/

and deeper, in unicodeobject.c,

/* Check for BOM marks (U+FEFF) in the input and adjust current
   byte order setting accordingly. In native mode, the leading BOM
   mark is skipped, in all other modes, it is copied to the output
   stream as-is (giving a ZWNBSP character). */

So initially, the byte order is set to the default for your system, and when you start decoding UTF-16 data and a BOM follows, the byte order gets set to whatever this specifies. The "native order" in this last comment refers to whether or not a certain byte order has been explicitly declared OR has been encountered by way of a BOM; and when neither is true, it will use your system's endianness.

这篇关于utf-16 编码字符串是否需要 [0xff, 0xfe] 前缀?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆