为什么在UTF-16和UTF-32编码的情况下必须指定BOM [英] Why do we have to specify BOM in case of UTF-16 and UTF-32 encodings

查看:59
本文介绍了为什么在UTF-16和UTF-32编码的情况下必须指定BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不太了解UTF编码和BOM的原理.

I don't quite understand the principles behind UTF encodings and BOM.

如果计算机已经知道如何将多字节数据类型(例如,大小为4字节的整数)组成一个变量,那么在UTF-16和UTF-32中使用BOM的意义何在?为什么我们需要为这些编码显式指定它?

What is the point of having BOM in UTF-16 and UTF-32 if computers already know how to compose multibyte data types (for example, integers with the size of 4 bytes) into one variable? Why do we need to specify it explicitly for these encodings then?

为什么我们不需要为UTF-8指定它呢?Unicode标准说它是面向字节的",但是即使那样,我们仍然需要知道它是否是编码代码点的第一个字节.还是在每个字符的前/后位中指定了它?

And why don't we need to specify it for UTF-8? Unicode standard says that it's "byte oriented" but even then we need to know whether it is the first byte of the encoded code point or not. Or does it specified in the first / last bits of every character?

推荐答案

UTF-16是两个字节宽,我们称其为 B0 | B1 字节.假设我们有一个字母"a",这在逻辑上是数字0x0061.不幸的是,不同的计算机体系结构以不同的方式将此数字存储在内存中,在x86平台上,较低有效字节首先存储(在较低的内存地址处),因此'a'将存储为 00 | 61 .在PowerPC上,它将存储为 61 | 00 ,因此,这两种体系结构被称为little endian和big endian.

UTF-16 is two byte wide, lets call that bytes B0|B1. Let's say we have letter 'a' this is logically number 0x0061. Unfortunately different computer architectures store this number in different ways in memory, on x86 platform less significant byte is stored first (at lower memory address) so 'a' will be stored as 00|61. On PowerPC this will be stored as 61|00, these two architectures are called little endian and big endian for that reason.

为了加快字符串处理速度,库通常以本机顺序(大结尾或小尾数)存储两个字节的字符.交换字节太昂贵了.

To speed up string processing libraries generally store two bytes characters in native order (big ending or little endian). Swapping bytes would be too expensive.

现在想象一下,PowerPC上有人将字符串写入文件,库将写入字节 00 | 61 ,现在x86上有人希望读取该字节,但这是否意味着 00 | 61还是 61 | 00 ?我们可以在字符串的开头放置特殊的序列,以便任何人都可以知道用于保存字符串的字节顺序,并正确地进行处理(在endian之间转换字符串是一项昂贵的操作,但是大多数情况下,x86字符串会在x86架构上读取,和PowerPC机器上的PowerPC字符串)

Now imagine that someone on PowerPC writes string to a file, library will write bytes 00|61, now someone on x86 will want to read this bytes but does it mean 00|61 or maybe 61|00? We can put special sequence at the beginning of the string so anyone will know byte order used to save string, and process it correctly (converting string between endian's is a costly operation, but most of the time x86 string will be read on x86 arch, and PowerPC string on PowerPC machines)

与UTF-8截然不同,UTF-8使用单顺序并将字符长度编码为第一个字符的第一位的模式.维基百科对UTF-8编码进行了很好的描述.一般来说,它是设计的,以避免endian'ess的问题

With UTF-8 this is different story, UTF-8 uses single order and encodes character length into pattern of first bits of first character. UTF-8 encoding is well described on Wikipedia. Generally speaking it was designed to avoid problem with endian'ess

这篇关于为什么在UTF-16和UTF-32编码的情况下必须指定BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆