Java - 什么是字符，代码点和代理？它们之间有什么区别？ [英] Java - what are characters, code points and surrogates? What difference is there between them?

查看：141 发布时间：2016/11/18 16:12:37 java character-encoding character

本文介绍了Java - 什么是字符，代码点和代理？它们之间有什么区别？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图找到字符，代码点和代理的解释，虽然这些术语不限于Java，如果有任何语言特定的差异，我想

I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

我已经找到一些关于字符和代码点之间的差异的信息，字符是为人类用户显示的，以及代码点是一个编码该特定字符的值，但我不知道代理。什么是代理，它们如何与字符和代码点不同？我对字符和代码点有正确的定义吗？

I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

在

In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

推荐答案

为了表示计算机中的文本，你必须解决两件事：首先，你必须将符号映射到数字，然后你必须用字节表示这些数字的序列。

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

代码点是标识符号的数字。用于将符号分配给符号的两个公知标准是ASCII和Unicode。 ASCII定义256个符号。 Unicode当前定义了109384个符号，这样的方式大于2 ^ 16。

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 256 symbols. Unicode currently defines 109384 symbols, that's way more than 2^16.

此外，ASCII指定数字序列每个数字表示一个字节，而Unicode指定几种可能性，例如UTF-8，UTF-16和UTF-32。

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

当您尝试使用每个字符少于表示所有可能值（例如使用16位的UTF-16）所需的位数的编码时，需要一些解决方法。

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bit), you need some workaround.

因此，代理是指示符号不适合单个双字节值的16位值。

Thus, Surrogates are 16bit values that indicate symbols that do not fit into a single two-byte value.

Java使用UTF-16。

Java uses UTF-16.

特别地， char （字符）是包含UTF-16值的无符号双字节值。

In particular, a char (character) is an unsigned two-byte value that contains an UTF-16 value.

如果您想了解有关Java和Unicode的更多信息，我可以推荐此通讯：第1部分，第2部分

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

这篇关于Java - 什么是字符，代码点和代理？它们之间有什么区别？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Java - 什么是字符，代码点和代理？它们之间有什么区别？ [英] Java - what are characters, code points and surrogates? What difference is there between them?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java - 什么是字符，代码点和代理？它们之间有什么区别？ [英] Java - what are characters, code points and surrogates? What difference is there between them?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭