Java使用UTF-8或UTF-16编码? [英] Which encoding does Java uses UTF-8 or UTF-16?

查看:447
本文介绍了Java使用UTF-8或UTF-16编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读以下帖子:


  1. Java对String的内部代表是什么?修改过的UTF-8? UTF-16?

  2. https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

  1. What is the Java's internal represention for String? Modified UTF-8? UTF-16?
  2. https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

现在考虑下面给出的代码:

Now consider the code given below:

public static void main(String[] args) {
    printCharacterDetails("最");
}

public static void printCharacterDetails(String character){
    System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
    byte[] bytes = character.getBytes();
    System.out.println("The UTF-8 Character="+character+"  | Default: Number of Bytes="+bytes.length);
    String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
    System.out.println("The corresponding UTF-16 Character="+stringUTF16+"  | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
    System.out.println("----------------------------------------------------------------------------------------");
}

当我尝试调试行 character.getBytes时()在上面的代码中,调试器将我带入String类的 getBytes()方法,然后进入 static byte [] encode(char [] ca,int off,int len) StringCoding类的方法。编码方法的第一行( String csn = Charset.defaultCharset()。name(); )在调试期间返回UTF-8作为默认编码。我预计它会是UTF-16。

When I tried to debug the line character.getBytes() in the code above, the debugger took me into the getBytes() method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len) method of StringCoding class. The first line of the encode method (String csn = Charset.defaultCharset().name();) returned "UTF-8" as the default encoding during the debugging. I expected it to be "UTF-16".

该程序的输出为:

Unicode值= = 6700
UTF- 8字符=最|默认值:字节数= 3

Unicode Value for 最=6700 The UTF-8 Character=最 | Default: Number of Bytes=3

相应的UTF-16字符= | UTF-16:字节数= 6

The corresponding UTF-16 Character=� | UTF-16: Number of Bytes=6

当我在程序中明确地将其转换为UTF-16时,它需要6个字节来表示字符。对于UTF-16,它不应该使用2或4个字节吗?为什么要使用6个字节?

When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. Shouldn't it use 2 or 4 bytes for UTF-16? Why 6 bytes were used?

我的理解在哪里出错了?
我使用Ubuntu 14.04并且locale命令显示以下内容:

Where am I going wrong in my understanding? I use Ubuntu 14.04 and the locale command shows the following:

LANG=en_US.UTF-8

这是否意味着JVM决定在底层操作系统的基础上使用哪种编码,还是仅使用UTF-16 ?
请帮助我理解这个概念。

Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only? Please help me understand the concept.

推荐答案

字符是图形实体,是人类文化的一部分。当计算机需要处理文本时,它使用这些字符的表示(以字节为单位)。使用的确切表示称为编码

Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representation of those characters in bytes. The exact representation used is called an encoding.

有许多编码可以表示相同的字符 - 通过Unicode字符集,或者通过其他字符集,如各种ISO-8859编码,或JIS X 0208。

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.

在内部,Java使用UTF-16。这意味着每个字符可以由两个字节的一个或两个序列表示。您使用的字符,最大,代码点为U + 6700,以UTF-16表示为字节0x67和字节0x00。

Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.

这是内部编码。除非转储内存并查看转储图像中的字节,否则无法看到它。

That's the internal encoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.

但方法 getBytes() 返回此内部表示。它的文档说:

But the method getBytes() does not return this internal representation. Its documentation says:


public byte [] getBytes()

使用平台的默认字符集将此字符串编码为字节序列
,将结果存储到新的$ b中$ b字节数组。

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

平台的默认字符集是您的语言环境变量所说的。也就是说, UTF-8 。因此它采用UTF-16内部表示,并将其转换为不同的表示形式 - UTF-8。

The "platform's default charset" is what your locale variables say it is. That is, UTF-8. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.

注意

new String(bytes, StandardCharsets.UTF_16);

明确地将其转换为UTF-16它确实。这个字符串构造函数接受一个字节序列,它应该是你在第二个参数中给出的编码,并将它转换为UTF-16表示,表示那些字节在该编码中表示的任何字符。

does not "convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.

但是你已经给它一个以UTF-8编码的字节序列,并告诉它将其解释为UTF-16。这是错误的,你没有得到你期望的字符 - 或字节 - 。

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.

你不能告诉Java如何在内部存储字符串。它总是将它们存储为UTF-16。构造函数 String(byte [],Charset)告诉Java从应该在给定字符集中的字节数组创建UTF-16字符串。方法 getBytes(Charset)告诉Java为您提供一个字节序列,表示给定编码(charset)中的字符串。没有参数的方法 getBytes()也是如此 - 但是使用平台的默认字符集进行转换。

You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion.

所以你误解了 getBytes()给你的东西。这是内部表示。你不能直接得到它。只有 getBytes(StandardCharsets.UTF_16)会给你这个,而且只是因为你知道 UTF-16 是内部的Java中的表示。如果Java的未来版本决定以不同的编码表示字符,那么 getBytes(StandardCharsets.UTF_16)将不会显示内部表示。

So you misunderstood what getBytes() gives you. It's not the internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation.

编辑:实际上,Java 9引入了字符串内部表示的这种更改,默认情况下,字符串全部属于ISO-8859- 1范围在ISO-8859-1中内部表示,而具有该范围之外的至少一个字符的字符串在内部以UTF-16表示,如前所述。的确, getBytes(StandardCharsets.UTF_16)不再返回内部表示。

in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

这篇关于Java使用UTF-8或UTF-16编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆