分离Unicode连字符 [英] Separating Unicode ligature characters

查看:129
本文介绍了分离Unicode连字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大量的unicode字符中,有一些实际上代表了多个字符,比如两个'f'字符的U + FB00连字ff。有没有什么方法可以轻松将这些字符转换为多个单个字符?最好是标准Java API中可用的东西,但如果需要,我可以引用外部库。

Throughout the vast number of unicode characters, there are some that actually represent more than one character, like the U+FB00 ligature ff for two 'f' characters. Is there any way easy to convert characters like these into multiple single characters? Preferably something available in the standard Java API, but I can refer to an external library if need be.

推荐答案

U + FB00是兼容性角色。通常,Unicode不支持连字的单独代码点(如果应该使用连字并且不应该影响数据的存储方式,则认为这是布局决策)。 少数仍然存在,以允许往返转换兼容旧的编码将连字表示为单独的实体。

U+FB00 is a compatibility character. Normally Unicode doesn't support separate codepoints for ligatures (arguing that it's a layout decision if and when a ligature should be used and should not influence how the data is stored). A few of those still exist to allow round-trip conversion compatibility with older encodings that do represent ligatures as separate entities.

幸运的是,结合字符所代表的信息 存在于 Unicode数据文件和大多数功能强大的字符串处理系统都内置了这些数据。

Luckily, the information which characters the ligature represents is present in the Unicode data file and most capable string handling systems have that data built-in.

在Java中,你需要使用 Normalizer class NFKC 表格:

In Java, you'll need to use the Normalizer class and the NFKC form:

String ff ="\uFB00";
String normalized = Normalizer.normalize(ff, Form.NFKC);
System.out.println(ff + " = " + normalized);

这将打印

ff = ff

这篇关于分离Unicode连字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆