Java - 什么是字符,代码点和代理?它们之间有什么区别? [英] Java - what are characters, code points and surrogates? What difference is there between them?

查看:141
本文介绍了Java - 什么是字符,代码点和代理?它们之间有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到字符,代码点和代理的解释,虽然这些术语不限于Java,如果有任何语言特定的差异,我想

I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

我已经找到一些关于字符和代码点之间的差异的信息,字符是为人类用户显示的,以及代码点是一个编码该特定字符的值,但我不知道代理。什么是代理,它们如何与字符和代码点不同?我对字符和代码点有正确的定义吗?

I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

推荐答案

为了表示计算机中的文本,你必须解决两件事:首先,你必须将符号映射到数字,然后你必须用字节表示这些数字的序列。

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

代码点是标识符号的数字。用于将符号分配给符号的两个公知标准是ASCII和Unicode。 ASCII定义256个符号。 Unicode当前定义了109384个符号,这样的方式大于2 ^ 16。

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 256 symbols. Unicode currently defines 109384 symbols, that's way more than 2^16.

此外,ASCII指定数字序列每个数字表示一个字节,而Unicode指定几种可能性,例如UTF-8,UTF-16和UTF-32。

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

当您尝试使用每个字符少于表示所有可能值(例如使用16位的UTF-16)所需的位数的编码时,需要一些解决方法。

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bit), you need some workaround.

因此,代理是指示符号不适合单个双字节值的16位值。

Thus, Surrogates are 16bit values that indicate symbols that do not fit into a single two-byte value.

Java使用UTF-16。

Java uses UTF-16.

特别地, char (字符)是包含UTF-16值的无符号双字节值。

In particular, a char (character) is an unsigned two-byte value that contains an UTF-16 value.

如果您想了解有关Java和Unicode的更多信息,我可以推荐此通讯:第1部分第2部分

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

这篇关于Java - 什么是字符,代码点和代理?它们之间有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆