如何在ruby中将UTF8组合字符转换为单个UTF8字符? [英] How to convert UTF8 combined Characters into single UTF8 characters in ruby?

查看:87
本文介绍了如何在ruby中将UTF8组合字符转换为单个UTF8字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

某些字符(例如Unicode字符带小号的拉丁文小写字母C")可以被编码为0xC4 0x8D,但也可以用拉丁文的小写字母C"和组合大写字母"的两个代码点表示,是0x63 0xcc 0x8c.
此处的更多信息: http://www.fileformat.info/info/unicode/char/10d/index.htm

Some characters such as the Unicode Character 'LATIN SMALL LETTER C WITH CARON' can be encoded as 0xC4 0x8D, but can also be represented with the two code points for 'LATIN SMALL LETTER C' and 'COMBINING CARON', which is 0x63 0xcc 0x8c.
More info here: http://www.fileformat.info/info/unicode/char/10d/index.htm

我想知道是否有一个库可以将拉丁文小写字母C" +合并纸箱"转换为带小号拉丁文小写字母C".还是有包含这些转换的表?

I wonder if there is a library which can convert a 'LATIN SMALL LETTER C' + 'COMBINING CARON' into 'LATIN SMALL LETTER C WITH CARON'. Or is there a table containing these conversions?

推荐答案

通常,您使用Unicode规范化来做到这一点.

Generally, you use Unicode Normalization to do this.

使用宝石unicode_utils( https://github.com/lang/unicode_utils 使用UnicodeUtils.nfkc >)应该可以让您获得所要求的特定行为; Unicode规范化形式kC将使用兼容性分解,然后将字符串转换为组合形式(如果可用)(基本上是示例所要求的形式). (您也可以通过标准化表格c(有时缩写为NFC)接近所需的内容.

Using UnicodeUtils.nfkc using the gem unicode_utils (https://github.com/lang/unicode_utils) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).

如何在Ruby 1.9上替换Unicode gem ?还有其他详细信息.

在Ruby 1.8.7中,您需要执行gem install Unicode,其功能类似.

In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.

编辑以添加:您可能想要归一化形式kC而不是仅归一化形式C的主要原因是连字(出于历史/印刷原因被压缩在一起的字符)将首先分解为单个字符,这如果您要按字典顺序进行排序或搜索,有时是可取的.

Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).

这篇关于如何在ruby中将UTF8组合字符转换为单个UTF8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆