在大字节序计算机上,UTF-8的字节顺序是否与小字节序计算机上的字节顺序不同?那么为什么UTF-8不需要BOM? [英] Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?
问题描述
UTF-8可以包含BOM.但是, 没有区别 字节流的字节序. UTF-8 总是具有相同的字节顺序.
如果Utf-8
将所有代码点存储在单个字节中,那么为什么字节序不起作用以及为什么不需要BOM
是有道理的.但是由于128和以上的代码点是使用2、3和最多6个字节存储的,这意味着它们在大字节序计算机上的字节顺序与小字节序计算机上的字节顺序不同,因此我们如何声明Utf-8
始终具有相同的字节顺序?
谢谢
UTF-8是面向字节的
我了解到,如果两个字节UTF-8
字符C
由字节B1
和B2组成(其中B1
是第一个字节,B2
是最后一个字节),那么使用UTF-8
时,这两个字节是始终以相同的顺序写入(因此,如果将此字符写入小字节序计算机LEM
上的文件,则B1
将是第一个,而B2
最后一个.类似地,如果将C
写入大字节序的文件中计算机BEM
,B1
仍将排在首位,而B2
仍排在最后).
但是将C
写入到LEM
上的文件F
时会发生什么,但是我们将F
复制到BEM
并尝试在那里读取它?由于BEM
自动交换字节(B1
现在是最后一个字节,而B2
是第一个字节),因此读取F
的app(在BEM
上运行)如何知道是否在BEM
上创建了F,因此以2为顺序字节未交换,或者是否从LEM
传输了F
,在这种情况下BEM
自动交换了字节?
我希望问题有道理
根据您的big-endian 机器不交换字节,如果您询问 他们一次读取一个字节.
a)哦,所以即使字符 C 是2个字节长,读取 F 的app(驻留在 BEM 上)也会读入内存一次只有一个字节(因此它将首先读入内存 B1 ,然后才读入 B2 )
b)
在UTF-8中,您决定如何处理 字节根据其高阶位
假设文件 F 具有两个后继字符 C 和 C1 (其中 C 由字节组成) B1 和 B2 ,而 C1 具有字节 B3 , B4 和 B5 ).读取 F 的应用程序将如何通过简单地检查每个字节的高位位来知道哪些字节属于同一字节(例如,如何确定 B1 和 B2 一起应该代表一个字符,而不是 B1 ,* B2 *和 B3 )?
如果您认为自己看到了 有所不同,请修改您的 问题并包含
我不是那样说的.我只是不明白发生了什么事
c)为什么Utf-16和Utf-32也不也是字节定向的?
在大字节序和小字节序的机器上,对于大于字节的单词/整数,字节顺序是不同的.
例如在big-endian机器上,一个2字节的短整数在第一个字节中存储8个最高有效位,在第二个字节中存储8个最低有效位.在小端机上,8个最高有效位将是第二个字节,8个最低有效位将在第一个字节中.
因此,如果您将这种short int的内存内容直接写入文件/网络,则short int中的字节顺序将根据字节顺序而有所不同.
UTF-8是面向字节的,因此关于字节顺序没有问题.第一个字节始终是第一个字节,第二个字节始终是第二个字节,依此类推.
UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.
If Utf-8
stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM
isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8
always has the same byte order?
Thank you
EDIT:
UTF-8 is byte oriented
I understand that if two byte UTF-8
character C
consists of bytes B1
and B2 ( where B1
is first byte and B2
is last byte ), then with UTF-8
those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM
, B1
will be first and B2
last. Similarly, if C
is written to a file on big endian machine BEM
, B1
will still be first and B2
still last).
But what happens when C
is written to file F
on LEM
, but we copy F
to BEM
and try to read it there? Since BEM
automatically swaps bytes ( B1
is now last and B2
first byte ), how will app ( running on BEM
) reading F
know whether F was created on BEM
and thus order of two bytes wasn’t swapped or whether F
was transferred from LEM
, in which case BEM
automatically swapped the bytes?
I hope question made some sense
EDIT 2:
In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.
a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )
b)
In UTF-8, you decide what to do with a byte based on its high-order bits
Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?
If you believe that you're seeing something different, please edit your question and include
I’m not saying that. I simply didn’t understand what was going on
c)Why aren't Utf-16 and Utf-32 also byte oriented?
The byte order is different on big endian vs little endian machines for words/integers larger than a byte.
e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.
So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.
UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.
这篇关于在大字节序计算机上,UTF-8的字节顺序是否与小字节序计算机上的字节顺序不同?那么为什么UTF-8不需要BOM?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!