java utf8编码 - char,字符串类型 [英] java utf8 encoding - char, string types

查看:150
本文介绍了java utf8编码 - char,字符串类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

public class UTF8 {
    public static void main(String[] args){
        String s = "ヨ"; //0xFF6E
        System.out.println(s.getBytes().length);//length of the string
        System.out.println(s.charAt(0));//first character in the string
    }
}

输出:

3
ヨ

请帮我理解这一点。试图了解utf8编码在java中的工作原理。
根据char
的java doc定义 char:char数据类型是一个16位Unicode字符。

Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character.

这是否意味着java中的char类型只能支持那些可以用2个字节表示但不超过2个字节的unicode字符?

Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?

在上面的程序中,no为该字符串分配的字节数为3但在第三行中返回第一个字符(java中的2个字节)可以保存3个字节长的字符?
在这里真的很困惑?

In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here?

在java / general中对这个概念的任何好的参考都会非常感激。

Any good references regarding this concept in java/ general would be really appreciated.

推荐答案

您的代码示例中没有任何内容直接使用UTF-8。 Java字符串使用UTF-16编码在内存中。不适合单个16位字符的Unicode代码点将使用称为代理项对的2字符对进行编码。

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

如果未传递参数值为 String.getBytes(),它返回一个字节数组,其中包含使用底层操作系统默认编码的 String 内容字符集。如果你想确保一个UTF-8编码的数组,那么你需要使用 getBytes(UTF-8)

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

调用 String.charAt()仅从String的内存存储中返回原始的UTF-16编码字符。

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

因此在您的示例中,Unicode字符存储在 String 内存中使用两个UTF-16编码的字节( 0x6E 0xFF 0xFF 0x6E 取决于端点),但存储在字节数组来自 getBytes()使用三个字节,这些字节使用操作系统默认字符集进行编码。

So in your example, the Unicode character is stored in the String in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is.

在UTF中-8,特定的Unicode字符恰好也使用3个字节( 0xEF 0xBD 0xAE )。

In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE).

这篇关于java utf8编码 - char,字符串类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆