Java 中的字符大小不是 2 个字节吗? [英] Isn't the size of character in Java 2 bytes?
问题描述
我使用 RandomAccessFile
从文本文件中读取 byte
.
I used RandomAccessFile
to read a byte
from a text file.
public static void readFile(RandomAccessFile fr) {
byte[] cbuff = new byte[1];
fr.read(cbuff,0,1);
System.out.println(new String(cbuff));
}
为什么我看到一个完整的字符被读取?
Why am I seeing one full character being read by this?
推荐答案
一个 char
代表 Java 中的一个字符 (*).它是 2 字节大(或 16 位).
A char
represents a character in Java (*). It is 2 bytes large (or 16 bits).
这并不一定意味着每个字符的表示都是 2 个字节长.事实上,许多字符编码只为每个字符保留 1 个字节(或者使用 1 个字节用于最常见的字符).
That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).
当您调用 String(byte[])
构造函数,您要求 Java 将 byte[]
转换为String
使用平台的默认字符集.由于平台默认字符集通常是 1 字节编码(例如 ISO-8859-1)或可变长度编码(例如 UTF-8),因此可以轻松地将该 1 字节转换为单个字符.
When you call the String(byte[])
constructor you ask Java to convert the byte[]
to a String
using the platform's default charset. Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.
如果您在使用 UTF-16(或 UTF-32 或 UCS-2 或 UCS-4 或 ...)作为平台默认编码的平台上运行该代码,那么您将不会得到有效结果(您将得到一个包含 Unicode 替换字符的 String
.
If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String
containing the Unicode Replacement Character instead).
这就是你不应该依赖平台默认编码的原因之一:在 byte[]
和 char[]
/String
之间转换时code> 或 InputStream
和 Reader
之间或 OutputStream
和 Writer
之间,你应该总是 指定要使用的编码.如果不这样做,那么您的代码将依赖于平台.
That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[]
and char[]
/String
or between InputStream
and Reader
or between OutputStream
and Writer
, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.
(*) 不完全正确:char
代表 UTF-16 代码单元.一个或两个 UTF-16 代码单元代表一个 Unicode 代码点.一个 Unicode 代码点通常代表一个字符,但有时使用多个 Unicode 代码点来组成一个字符.但上面的近似值足以讨论手头的话题.
(*) that's not entirely true: a char
represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.
这篇关于Java 中的字符大小不是 2 个字节吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!