澄清Java对Unicode的进化支持 [英] Clarifying Java's evolutionary support of Unicode

查看:113
本文介绍了澄清Java对Unicode的进化支持的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现Java将char和codepoint区分为奇怪而且不合适。

I'm finding Java's differentiation of char and codepoint to be strange and out of place.

例如,字符串是一个字符数组或字母出现在字母表中;与可能是单个字母或可能是复合或代理对的代码点相反。但是,Java将字符串的字符定义为 char ,它不能复合或包含代码点的代理项和 int (这很好)。

For example, a string is an array of characters or "letters which appear in an alphabet"; in contrast to codepoint which MAY be a single letter or possibly a composite or surrogate pair. However, Java defines a character of a string as a char which cannot be composite or contain a surrogate the codepoint and as an int (this is fine).

但是 length()似乎返回了代码点的数量,而 codePointCount()还返回代码点的数量,但是却组合了复合字符..这最终不是真正的代码点数?

But then length() seems to return the number of codepoints while codePointCount() also returns the number of codepoints but instead combines composite characters.. which ends up not really being the real count of codepoints?

感觉好像 charAt()应该返回一个 String ,以便复合和代理带来并且 length()的结果应与 codePointCount()交换。

It feels as though charAt() should return a String so that composites and surrogates are brought along and the result of length() should swap with codePointCount().

最初的实施感觉有点倒退。它的设计方式是否有原因?

The original implementation feels a little backwards. Is there a reason for the way it's designed the way it is?

更新: codePointAt() codePointBefore()

Update: codePointAt(), codePointBefore()

值得注意的是 codePointAt() codePointBefore()接受索引作为参数,但是,索引作用于字符并且范围 0 length() - 1 因此不是基于字符串中的代码点数量,正如人们可能假设的那样。

It's also worth noting that codePointAt() and codePointBefore() accept an index as a parameter, however, the index acts upon chars and has a range of 0 to length() - 1 and is therefore not based on the number of codepoints in the string, as one might assume.

更新: equalsIgnoreCase()

Update: equalsIgnoreCase()

String.equalsIgnoreCase()使用术语规范化来描述它在比较字符串之前的作用。这是一个误称,因为Unicode字符串上下文中的规范化可能意味着完全不同的东西。他们的意思是说他们使用大小写折叠。

String.equalsIgnoreCase() uses the term normalization to describe what it does prior to comparing strings. This is a misnomer as normalization in the context of a Unicode string can mean something entirely different. What they mean to say is that they use case-folding.

推荐答案

创建java时,Unicode没有概念代理字符和java决定将字符表示为16位值。

When java was created Unicode didn't have the notion of surrogate characters and java decided to represent characters as 16bit values.

我想他们不想破坏向后兼容性。这里有更多信息: http://www.oracle.com /us/technologies/java/supplementary-142654.html

I suppose they don't want to break backwards compatibility. There is a lot more information here: http://www.oracle.com/us/technologies/java/supplementary-142654.html

这篇关于澄清Java对Unicode的进化支持的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆