带字符串的Unicode字符串由字符分隔 [英] Unicode string with diacritics split by chars

查看:344
本文介绍了带字符串的Unicode字符串由字符分隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个Unicode字符串:АааБббВввГ㥴Дд

I have this Unicode string: Ааа́Ббб́Ввв́Г㥴Дд

我希望它按字符分割。
现在,如果我尝试循环所有字符,我得到这样的东西:

Aaa'Б...

And I want to it split by chars. Right now if I try to loop truth all chars I get something like this:
A a a ' Б ...

有没有办法将此字符串正确分割为字符:Аа

Is there a way to properly split this string to chars: А а а́ ?

推荐答案

为了做到这一点,你想要的是用于计算字形簇边界的算法,如 UAX 29 。不幸的是,这需要知道哪些字符是哪些类的成员,从Unicode字符数据库和JavaScript不能提供该信息(*)。因此,您必须在脚本中加入UCD的副本,这会使它变得相当笨重。

To do this properly, what you want is the algorithm for working out the grapheme cluster boundaries, as defined in UAX 29. Unfortunately this requires knowledge of which characters are members of which classes, from the Unicode Character Database, and JavaScript doesn't make that information available(*). So you'd have to include a copy of the UCD with your script, which would make it pretty bulky.

如果你只需要担心基本的口音,那么另一种选择拉丁语或西里尔语使用的只是组合变音标记块(U + 0300-U + 036F)。对于其他语言和符号,这可能会失败,但对于您想要做的事情可能就足够了。

An alternative if you only need to worry about the basic accents used by Latin or Cyrillic would be to take only the Combining Diacritical Marks block (U+0300-U+036F). This would fail for other languages and symbols, but might be enough for what you want to do.

function findGraphemesNotVeryWell(s) {
    var re= /.[\u0300-\u036F]*/g;
    var match, matches= [];
    while (match= re.exec(s))
        matches.push(match[0]);
    return matches;
}

findGraphemesNotVeryWell('Ааа́Ббб́Ввв́Г㥴Дд');
["А", "а", "а́", "Б", "б", "б́", "В", "в", "в́", "Г", "г", "Ґ", "ґ", "Д", "д"]

(*:有可能是一种通过让浏览器呈现字符串来提取信息的方法,并测量其中的选择位置...但是它肯定会非常混乱并且难以在跨浏览器中工作。)

(*: there might be a way to extract the information by letting the browser render the string, and measuring the positions of selections in it... but it would surely be very messy and difficult to get working cross-browser.)

这篇关于带字符串的Unicode字符串由字符分隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆