Java - 什么是字符,代码点和代理?它们之间有什么区别? [英] Java - what are characters, code points and surrogates? What difference is there between them?
问题描述
我试图找到字符,代码点和代理的解释,虽然这些术语不限于Java,如果有任何语言特定的差异,我想
I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.
我已经找到一些关于字符和代码点之间的差异的信息,字符是为人类用户显示的,以及代码点是一个编码该特定字符的值,但我不知道代理。什么是代理,它们如何与字符和代码点不同?我对字符和代码点有正确的定义吗?
I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?
In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.
推荐答案
为了表示计算机中的文本,你必须解决两件事:首先,你必须将符号映射到数字,然后你必须用字节表示这些数字的序列。
To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.
代码点是标识符号的数字。用于将符号分配给符号的两个公知标准是ASCII和Unicode。 ASCII定义256个符号。 Unicode当前定义了109384个符号,这样的方式大于2 ^ 16。
A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 256 symbols. Unicode currently defines 109384 symbols, that's way more than 2^16.
此外,ASCII指定数字序列每个数字表示一个字节,而Unicode指定几种可能性,例如UTF-8,UTF-16和UTF-32。
Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.
当您尝试使用每个字符少于表示所有可能值(例如使用16位的UTF-16)所需的位数的编码时,需要一些解决方法。
When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bit), you need some workaround.
因此,代理是指示符号不适合单个双字节值的16位值。
Thus, Surrogates are 16bit values that indicate symbols that do not fit into a single two-byte value.
Java使用UTF-16。
Java uses UTF-16.
特别地, char
(字符)是包含UTF-16值的无符号双字节值。
In particular, a char
(character) is an unsigned two-byte value that contains an UTF-16 value.
如果您想了解有关Java和Unicode的更多信息,我可以推荐此通讯:第1部分,第2部分
If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2
这篇关于Java - 什么是字符,代码点和代理?它们之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!