编码-codePointCount和length之间的结果不同 [英] encodings - different result between codePointCount and length

查看:37
本文介绍了编码-codePointCount和length之间的结果不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了一个棘手的地方,却找不到答案,为什么会这样.

I found one tricky place and couldn't find any answer why this exactly happen.

主要问题是字符串多长时间.

The main problem is how long is string.

它是否包含一个或两个字符.

Whether it contains one or two character.

代码:

public class App {
    public static void main(String[] args) throws Exception {
        char ch0 = 55378;
        char ch1 = 56816;
        String str = new String(new char[]{ch0, ch1});
        System.out.println(str);
        System.out.println(str.length());
        System.out.println(str.codePointCount(0, 2));
        System.out.println(str.charAt(0));
        System.out.println(str.charAt(1));
    }
}

输出:

?
2
1
?
?

有什么建议吗?

推荐答案

它是否包含一个或两个字符.

Whether it contains one or two character.

它包含一个Unicode 字符,由2个UTF-16 代码单元组成.Java中的每个 char 都是UTF-16代码单元...它可能不是一个完整的字符.每个字符都有一个代码点-Unicode提供了一个编码字符集,将每个字符映射到表示该字符的整数(代码点).

It contains one Unicode character, which is comprised of 2 UTF-16 code units. Every char in Java is a UTF-16 code unit... it may not be a whole character. Each character has a single code point - Unicode provides a coded character set mapping each character to an integer representing that character (the code point).

length()返回代码单位的数量,而 codePointCount 返回代码点的数量.

length() returns the number of code units, whereas codePointCount returns the number of code points.

您可能想看一下我的文章,有关 .NET中的编码-术语一切都可以正常翻译(因为这是标准术语),因此只需忽略.NET特定的部分.

You may want to look at my article about encodings in .NET - the terminology all translates fine (as it's standard terminology), so just ignore the .NET-specific parts.

这篇关于编码-codePointCount和length之间的结果不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆