在大端机器上 UTF-8 的字节顺序是不是和小端机器上的不同?那么为什么 UTF-8 不需要 BOM 呢? [英] Isn’t on big endian machines UTF-8's byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

查看:25
本文介绍了在大端机器上 UTF-8 的字节顺序是不是和小端机器上的不同?那么为什么 UTF-8 不需要 BOM 呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

UTF-8 可以包含 BOM.然而,它没有区别字节流的字节序.UTF-8始终具有相同的字节顺序.

UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order.

如果 Utf-8 将所有代码点存储在一个字节中,那么为什么字节顺序不起作用以及为什么 BOM 是有意义的需要.但是由于代码点 128 及以上使用 2、3 和最多 6 个字节存储,这意味着它们在 big endian 机器上的字节顺序与在 little endian 机器上的字节顺序不同,所以我们如何声明 Utf-8 总是有相同的字节顺序?

If Utf-8 stored all code-points in a single byte, then it would make sense why endianness doesn’t play any role and thus why BOM isn’t required. But since code points 128 and above are stored using 2, 3 and up to 6 bytes, which means their byte order on big endian machines is different than on little endian machines, so how can we claim Utf-8 always has the same byte order?

谢谢

UTF-8 是面向字节的

UTF-8 is byte oriented

我知道如果两字节 UTF-8 字符 C 由字节 B1 和 B2 组成(其中 B1> 是第一个字节,B2 是最后一个字节),然后使用 UTF-8 这两个字节总是以相同的顺序写入(因此,如果将此字符写入文件在小端机器 LEM 上,B1 将在最前面,B2 在最后.类似地,如果将 C 写入一个大端机器上的文件 BEMB1 仍然是第一个,B2 仍然是最后一个.

I understand that if two byte UTF-8 character C consists of bytes B1 and B2 ( where B1 is first byte and B2 is last byte ), then with UTF-8 those two bytes are always written in the same order ( thus if this character is written to a file on little endian machine LEM, B1 will be first and B2 last. Similarly, if C is written to a file on big endian machine BEM, B1 will still be first and B2 still last).

但是当C被写入LEM上的文件F时会发生什么,但是我们将F复制到BEM 并尝试在那里阅读?由于 BEM 自动交换字节( B1 现在是最后一个,B2 第一个字节),应用程序(在 BEM ) 阅读 F 知道 F 是否是在 BEM 上创建的,因此两个字节的顺序没有交换或者 F 是否从 转移>LEM,在哪种情况下BEM 会自动交换字节?

But what happens when C is written to file F on LEM, but we copy F to BEM and try to read it there? Since BEM automatically swaps bytes ( B1 is now last and B2 first byte ), how will app ( running on BEM ) reading F know whether F was created on BEM and thus order of two bytes wasn’t swapped or whether F was transferred from LEM, in which case BEM automatically swapped the bytes?

我希望这个问题有意义

编辑 2:

回应你的big-endian如果你问,机器不会交换字节他们一次读取一个字节.

In response to your edit: big-endian machines do not swap bytes if you ask them to read a byte at a time.

a) 哦,所以即使字符 C 是 2 个字节长,应用程序(驻留在 BEM 上)读取 F 将读入内存一次只有一个字节(因此它会首先读入内存 B1 然后才B2 )

a) Oh, so even though character C is 2 bytes longs, app ( residing on BEM ) reading F will read into memory just one byte at the time ( thus it will first read into memory B1 and only then B2 )

b)

在 UTF-8 中,您决定如何处理字节基于其高位

In UTF-8, you decide what to do with a byte based on its high-order bits

假设文件F有两个后续字符CC1(其中C由字节组成B1B2C1 有字节 B3, B4B5> ).应用读取 F 如何通过检查每个字节的高位(例如,它如何确定 B1B2 加在一起应该代表一个字符而不是 B1,*B2* 和 B3)?

Assuming file F has two consequent characters C and C1 ( where C consists of bytes B1 and B2 while C1 has bytes B3, B4 and B5 ). How will app reading F know which bytes belong together simply by checking each byte's high-order bits ( for example, how will it figure out that B1 and B2 taken together should represent a character and not B1,*B2* and B3)?

如果你相信你看到了不同的东西,请编辑你的问题并包括

If you believe that you're seeing something different, please edit your question and include

我不是这么说的.我根本不明白这是怎么回事

I’m not saying that. I simply didn’t understand what was going on

c) 为什么 Utf-16 和 Utf-32 不是面向字节的?

c)Why aren't Utf-16 and Utf-32 also byte oriented?

推荐答案

对于大于一个字节的字/整数,在 big endian 和 little endian 机器上的字节顺序是不同的.

The byte order is different on big endian vs little endian machines for words/integers larger than a byte.

例如在大端机器上,一个 2 字节的短整数存储第一个字节中的 8 个最高有效位,第二个字节中的 8 个最低有效位.在小端机器上,8 个最高有效位将作为第二个字节,第一个字节中的 8 个最低有效位.

e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most significant bits will the second byte, the 8 least significant bits in the first byte.

因此,如果将这种短整数的内存内容直接写入文件/网络,则短整数内的字节顺序会因字节序而异.

So, if you write the memory content of such a short int directly to a file/network, the byte ordering within the short int will be different depending on the endianness.

UTF-8 是面向字节的,因此不存在字节顺序问题.无论字节顺序如何,第一个字节始终是第一个字节,第二个字节始终是第二个字节等等.

UTF-8 is byte oriented, so there's not an issue regarding endianness. the first byte is always the first byte, the second byte is always the second byte etc. regardless of endianness.

这篇关于在大端机器上 UTF-8 的字节顺序是不是和小端机器上的不同?那么为什么 UTF-8 不需要 BOM 呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆