什么是最好的 UTF [英] What is the Best UTF

查看：30 发布时间：2021/9/15 19:41:20 unicode utf-8 utf

本文介绍了什么是最好的 UTF的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 Unicode 中的 UTF 真的很困惑.

有 UTF-8、UTF-16 和 UTF-32.

我的问题是:

哪些 UTF 支持所有 Unicode 块?
什么是最好的 UTF(性能、大小等)，为什么?
这三种 UTF 有什么不同?
什么是字节序和字节顺序标记 (BOM)?

谢谢

解决方案

支持所有 Unicode 块的 UTF 是什么?

所有 UTF 编码都支持所有 Unicode 块 - 没有不能代表任何 Unicode 代码点的 UTF 编码.但是，某些非 UTF 较旧的编码，例如 UCS-2(类似于 UTF-16，但缺少代理对，因此无法对 65535/U+FFFF 以上的代码点进行编码)可能不会.

<块引用>

什么是最好的 UTF(性能、大小等)，为什么?

对于主要是英语和/或只是 ASCII 的文本数据，UTF-8 是迄今为止最节省空间的.但是，UTF-8 有时比 UTF-16 和 UTF-32 的空间效率低，后者使用的大多数代码点都很高(例如大量的 CJK 文本).

<块引用>

这三种UTF有什么不同?

UTF-8 将每个 Unicode 代码点编码为 1 到 4 个字节.Unicode 值 0 到 127 与它们在 ASCII 中的相同，其编码方式与 ASCII 相同.值为 128 到 255 的字节用于多字节代码点.

UTF-16 将每个 Unicode 代码点编码为两个字节(一个 UTF-16 值)或四个字节(两个 UTF-16 值).基本多语言平面(Unicode 代码点 0 到 65535，或 U+0000 到 U+FFFF)中的任何内容都使用一个 UTF-16 值进行编码.来自高等平原的代码点通过一种称为代理对"的技术使用两个 UTF-16 值.

UTF-32 不是 Unicode 的可变长度编码；所有 Unicode 代码点值都按原样编码.这意味着 U+10FFFF 被编码为 0x0010FFFF.

<块引用>

什么是字节序和字节顺序标记 (BOM)?

字节序是一段数据、特定 CPU 架构或协议如何对多字节数据类型的值进行排序.小端系统(例如 x86-32 和 x86-64 CPU)将最不重要的字节放在最前面，而大端系统(例如 ARM、PowerPC 和许多网络协议)将最重要的字节放在最前面.

在 little-endian 编码或系统中，32 位值 0x12345678 存储或传输为 0x78 0x56 0x34 0x12.在大端编码或系统中，它以0x12 0x34 0x56 0x78的形式存储或传输.

在 UTF-16 和 UTF-32 中使用字节顺序标记来表示文本将被解释为哪种字节序.Unicode 以一种巧妙的方式做到了这一点——U+FEFF 是一个有效的代码点，用于字节顺序标记，而 U+FFFE 则不是.因此，如果文件以 0xFF 0xFE 开头，则可以假设文件的其余部分以 little-endian 字节顺序存储.

UTF-8 中的字节顺序标记在技术上是可能的，但由于明显的原因，在字节序的上下文中是没有意义的.但是，以 UTF-8 编码的 BOM 开头的流几乎肯定意味着它是 UTF-8，因此可以用于识别.

UTF-8 的好处

ASCII 是 UTF-8 编码的一个子集，因此是一种将 ASCII 文本引入Unicode 世界"而无需进行数据转换的好方法
UTF-8 文本是最紧凑的 ASCII 文本格式
有效的 UTF-8 可以按字节值排序并产生排序的代码点

UTF-16 的好处

UTF-16 比 UTF-8 更容易解码，即使它是一种可变长度编码
对于 BMP 中的字符，UTF-16 比 UTF-8 更节省空间，但在 ASCII 之外

UTF-32 的好处

UTF-32 不是可变长度的，因此不需要特殊的逻辑来解码

I'm really confused about UTF in Unicode.

there is UTF-8, UTF-16 and UTF-32.

my question is :

what UTF that are support all Unicode blocks ?
What is the best UTF(performance, size, etc), and why ?
What is different between these three UTF ?
what is endianness and byte order marks (BOM) ?

Thanks

解决方案

what UTF that are support all Unicode blocks ?

All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.

What is the best UTF(performance, size, etc), and why ?

For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).

What is different between these three UTF ?

UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.

UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.

UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.

what is endianness and byte order marks (BOM) ?

Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.

In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.

A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.

A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.

Benefits of UTF-8

ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
UTF-8 text is the most compact format for ASCII text
Valid UTF-8 can be sorted on byte values and result in sorted codepoints

Benefits of UTF-16

UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII

Benefits of UTF-32

UTF-32 is not variable-length, so it requires no special logic to decode

这篇关于什么是最好的 UTF的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是最好的 UTF [英] What is the Best UTF

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

什么是最好的 UTF [英] What is the Best UTF

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭