在UTF-16,UTF-16BE,UTF-16LE中,UTF-16的字节序是否是计算机的字节序? [英] In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

查看:90
本文介绍了在UTF-16,UTF-16BE,UTF-16LE中,UTF-16的字节序是否是计算机的字节序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

UTF-16是两字节字符编码。交换两个字节的地址将产生UTF-16BE和UTF-16LE。

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE.

但是我发现UTF-16编码的名称存在于Ubuntu gedit 文本编辑器,以及UTF-16BE和UTF-16LE。使用C测试程序,我发现我的计算机是低端字节序,并且已确认UTF-16与UTF-16LE的编码相同。

But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE.

还:在小/大字节序计算机中,值(例如整数)有两个字节顺序。小型字节序计算机将在硬件中产生较小的字节序值(Java产生的值始终会形成较大的字节序)。

Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in hardware (except the value produced by Java which always forms a big endian).

虽然文本可以另存为UTF-16LE作为以及我的小端计算机中的UTF-16BE都是一个字节一个字节地产生的字符(例如ASCII字符串,对[3]的引用以及人类刚刚定义的UTF-16的字节序),而不是-大字节序机器写大字节序UTF-16而小字节序机器写小字节序UTF-16的现象?

While text can be saved as UTF-16LE as well as UTF-16BE in my little endian computer, are characters produced one byte by one byte (such as the ASCII string, reference to [3] and the endianness of UTF-16 just defined by the human -- not as a result of the phenomenon that big endian machines write big endian UTF-16 while little endian machines write little endian UTF-16?


  1. > http://www.ibm.com/developerworks/aix/library/au-endianc/

  2. http:// teaching .idallen.com / cst8281 / 10w / notes / 110_byte_order_endian.html

  3. ASCII字符串和字节序

  4. 字节序是否只影响数字的内存布局,而不会影响字符串吗?和机器。

  1. http://www.ibm.com/developerworks/aix/library/au-endianc/
  2. http://teaching.idallen.com/cst8281/10w/notes/110_byte_order_endian.html
  3. ASCII strings and endianness
  4. Is it true that endianness only affects the memory layout of numbers,but not string? This a post of relation between endianness of string and machine.


推荐答案

是UTF-16的字节序计算机的字节序?

"is endian of UTF-16 the computer's endianness?"

计算机字节序的影响可以从书写器的角度来看或文件的阅读器

The impact of your computer's endianness can be looked at from the point of view of a writer or a reader of a file.

如果您正在以-standard-格式读取文件,则该类型为机器读取没关系格式应该定义得足够好,以至于无论阅读机的字节序是什么,都仍然可以正确读取数据。

If you are reading a file in a -standard- format, then the kind of machine reading it shouldn't matter. The format should be well-defined enough that no matter what the endianness of the reading machine is, the data can still be read correctly.

这并不意味着格式不能灵活。使用 UTF-16(在格式名称中未使用 BE或 LE消除歧义时),该定义允许将文件标记为 大端或小端。这是通过在文件的前两个字节中使用字节顺序标记(BOM)来完成的:

That doesn't mean the format can't be flexible. With "UTF-16" (when a "BE" or "LE" disambiguation is not used in the format name) the definition allows files to be marked as either big endian or little endian. This is done with something called the "Byte Order Mark" (BOM) in the first two bytes of the file:

https://en.wikipedia.org/wiki/Byte_order_mark

BOM提供选项给文件的编写者。他们可能会选择为内存中的缓冲区写出最自然的字节序,并包含一个匹配的BOM。对于其他读者来说,这不一定是最有效的格式。但是任何声称支持UTF-16的程序都应该能够以任何一种方式处理它。

The existence of the BOM gives options to the writer of a file. They might choose to write out the most natural endianness for a buffer in memory, and include a BOM that matched. This wouldn't necessarily be the most efficient format for some other reader. But any program claiming UTF-16 support is supposed to be able to handle it either way.

所以是的-计算机的字节序可能会影响BOM的字节序选择标记的UTF-16文件。仍然...小端程序可以完全保存文件,将其标记为 UTF-16,并使其为大端。只要BOM与数据保持一致,无论哪种机器来读写数据。

So yes--the computer's endianness might factor into the endianness choice of a BOM-marked UTF-16 file. Still...a little-endian program is fully able to save a file, label it "UTF-16" and have it be big-endian. As long as the BOM is consistent with the data, it doesn't matter what kind of machine writes or reads it.

...该怎么办? BOM?

这是事情变得有些朦胧的地方。

This is where things get a little hazy.

一方面, Unicode RFC 2781 和Unicode FAQ很清楚。他们说,以 UTF-16格式开头且既不是 0xFF 0xFE 也不是 0xFE 0xFF 的文件是解释为大字节序

On the one hand, the Unicode RFC 2781 and Unicode FAQ are clear. They say that a file in "UTF-16" format which starts with neither 0xFF 0xFE nor 0xFE 0xFF is to be interpreted as big endian:


默认情况下,未标记形式使用大端字节序列化,但开头可能包含字节顺序标记以指示实际使用的字节序列化。

the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used.

还要知道是否有没有BOM的UTF-16-LE,UTF-16-BE或UTF-16文件...您需要文件外的元数据来告诉您这是三个。因为并不总是有放置数据的地方,所以某些程序使用启发式方法结束工作。

Yet to know if you have UTF-16-LE, UTF-16-BE, or UTF-16 file with no BOM...you need metadata outside the file telling you which of the three it is. Because there's not always a place to put that data, some programs wound up using heuristics.

考虑类似这是Raymond Chen(2007)的


您可能会确定生成没有BOM的UTF-16文件的程序已损坏,但这并不意味着它们不存在。例如,

You might decide that programs that generate UTF-16 files without a BOM are broken, but that doesn't mean that they don't exist. For example,

cmd /u /c dir >results.txt

这会生成没有BOM的UTF-16LE文件。

This generates a UTF-16LE file without a BOM.

这是有效的UTF-16LE文件,但是 UTF-16LE元标签将存储在哪里?有人通过称其为UTF-16文件而冒充的可能性有多大?

That's a valid UTF-16LE file, but where would the "UTF-16LE" meta-label be stored? What are the odds someone passes that off by just calling it a UTF-16 file?

凭经验,该术语存在警告。 Wikipedia 针对UTF-16的页面表示:

Empirically there are warnings about the term. The Wikipedia page for UTF-16 says:


如果缺少BOM,则RFC 2781表示应采用大端编码。 (实际上,由于Windows默认情况下使用低位优先顺序,因此许多应用程序默认都采用低位顺序编码。)

If the BOM is missing, RFC 2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

unicode.readthedocs.org 说:

UTF-16和 UTF-32编码名称不准确:根据上下文,格式或协议,这意味着带有BOM标记的UTF-16和UTF-32或UTF-不带BOM的主机字节序中为16和UTF-32。在Windows上, UTF-16通常表示UTF-16-LE。

"UTF-16" and "UTF-32" encoding names are imprecise: depending of the context, format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 and UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means UTF-16-LE.

此外,字节顺序标记维基百科文章说:


符合Unicode标准的条款D98(第3.10节)指出: UTF-16编码方案可能以BOM开头也可能不是。没有BOM,并且在没有更高级别协议的情况下,UTF-16编码方案的字节顺序为big-endian。

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian."

是否现行的更高级别的协议尚待解释。例如,对于本地字节顺序为little-endian的计算机而言,本地文件可能会被隐式编码为UTF-16LE。因此,big-endian的推定被广泛忽略。

Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored.

另一方面,当这些相同文件可以在Internet上访问时,另一方面,
则不能这样推定被制造。搜索ASCII范围内的16位字符
或仅搜索空格字符(U + 0020)是
确定UTF-16字节顺序的一种方法。

When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.

因此,尽管标准含糊不清,但实际情况可能仍然很重要。

So despite the unambiguity of the standard, the context may matter in practice.

@rici指出,标准已经存在了一段时间。尽管如此,对声称为 UTF-16的文件进行仔细检查还是值得的。甚至考虑是否要避免很多此类问题并采用UTF-8 ...

As @rici points out, the standard has been around for a while now. Still, it may pay to do double-checks on files claimed as "UTF-16". Or even to consider if you might want to avoid a lot of these issues and embrace UTF-8...

应该将UTF-16视为有害吗?

这篇关于在UTF-16,UTF-16BE,UTF-16LE中,UTF-16的字节序是否是计算机的字节序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆