什么是最好的 UTF [英] What is the Best UTF

查看:30
本文介绍了什么是最好的 UTF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Unicode 中的 UTF 真的很困惑.

有 UTF-8、UTF-16 和 UTF-32.

我的问题是:

  1. 哪些 UTF 支持所有 Unicode 块?

  2. 什么是最好的 UTF(性能、大小等),为什么?

  3. 这三种 UTF 有什么不同?

  4. 什么是字节序和字节顺序标记 (BOM)?

谢谢

解决方案

支持所有 Unicode 块的 UTF 是什么?

所有 UTF 编码都支持所有 Unicode 块 - 没有不能代表任何 Unicode 代码点的 UTF 编码.但是,某些非 UTF 较旧的编码,例如 UCS-2(类似于 UTF-16,但缺少代理对,因此无法对 65535/U+FFFF 以上的代码点进行编码)可能不会.

<块引用>

什么是最好的 UTF(性能、大小等),为什么?

对于主要是英语和/或只是 ASCII 的文本数据,UTF-8 是迄今为止最节省空间的.但是,UTF-8 有时比 UTF-16 和 UTF-32 的空间效率低,后者使用的大多数代码点都很高(例如大量的 CJK 文本).

<块引用>

这三种UTF有什么不同?

UTF-8 将每个 Unicode 代码点编码为 1 到 4 个字节.Unicode 值 0 到 127 与它们在 ASCII 中的相同,其编码方式与 ASCII 相同.值为 128 到 255 的字节用于多字节代码点.

UTF-16 将每个 Unicode 代码点编码为两个字节(一个 UTF-16 值)或四个字节(两个 UTF-16 值).基本多语言平面(Unicode 代码点 0 到 65535,或 U+0000 到 U+FFFF)中的任何内容都使用一个 UTF-16 值进行编码.来自高等平原的代码点通过一种称为代理对"的技术使用两个 UTF-16 值.

UTF-32 不是 Unicode 的可变长度编码;所有 Unicode 代码点值都按原样编码.这意味着 U+10FFFF 被编码为 0x0010FFFF.

<块引用>

什么是字节序和字节顺序标记 (BOM)?

字节序是一段数据、特定 CPU 架构或协议如何对多字节数据类型的值进行排序.小端系统(例如 x86-32 和 x86-64 CPU)将最不重要的字节放在最前面,而大端系统(例如 ARM、PowerPC 和许多网络协议)将最重要的字节放在最前面.

在 little-endian 编码或系统中,32 位值 0x12345678 存储或传输为 0x78 0x56 0x34 0x12.在大端编码或系统中,它以0x12 0x34 0x56 0x78的形式存储或传输.

在 UTF-16 和 UTF-32 中使用字节顺序标记来表示文本将被解释为哪种字节序.Unicode 以一种巧妙的方式做到了这一点——U+FEFF 是一个有效的代码点,用于字节顺序标记,而 U+FFFE 则不是.因此,如果文件以 0xFF 0xFE 开头,则可以假设文件的其余部分以 little-endian 字节顺序存储.

UTF-8 中的字节顺序标记在技术上是可能的,但由于明显的原因,在字节序的上下文中是没有意义的.但是,以 UTF-8 编码的 BOM 开头的流几乎肯定意味着它是 UTF-8,因此可以用于识别.

UTF-8 的好处

  • ASCII 是 UTF-8 编码的一个子集,因此是一种将 ASCII 文本引入Unicode 世界"而无需进行数据转换的好方法
  • UTF-8 文本是最紧凑的 ASCII 文本格式
  • 有效的 UTF-8 可以按字节值排序并产生排序的代码点

UTF-16 的好处

  • UTF-16 比 UTF-8 更容易解码,即使它是一种可变长度编码
  • 对于 BMP 中的字符,UTF-16 比 UTF-8 更节省空间,但在 ASCII 之外

UTF-32 的好处

  • UTF-32 不是可变长度的,因此不需要特殊的逻辑来解码

I'm really confused about UTF in Unicode.

there is UTF-8, UTF-16 and UTF-32.

my question is :

  1. what UTF that are support all Unicode blocks ?

  2. What is the best UTF(performance, size, etc), and why ?

  3. What is different between these three UTF ?

  4. what is endianness and byte order marks (BOM) ?

Thanks

解决方案

what UTF that are support all Unicode blocks ?

All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.

What is the best UTF(performance, size, etc), and why ?

For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).

What is different between these three UTF ?

UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.

UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.

UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.

what is endianness and byte order marks (BOM) ?

Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.

In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.

A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.

A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.

Benefits of UTF-8

  • ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
  • UTF-8 text is the most compact format for ASCII text
  • Valid UTF-8 can be sorted on byte values and result in sorted codepoints

Benefits of UTF-16

  • UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
  • UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII

Benefits of UTF-32

  • UTF-32 is not variable-length, so it requires no special logic to decode

这篇关于什么是最好的 UTF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆