由字符串拆分的Unicode字符串 [英] Unicode string split by chars

查看:219
本文介绍了由字符串拆分的Unicode字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个Unicode字符串:АааБббввГ㥴Дд




现在如果我尝试循环真相所有字符,我得到这样的东西:

A aa'Б...



有没有办法将此字符串正确分割为chars:Ааа


<为了正确地做到这一点,你想要的是用于计算矩形集群边界的算法,如 / reports / tr29 /#Default_Grapheme_Cluster_Table> UAX 29 。不幸的是,这需要知道哪些字符是哪些类的成员,从Unicode字符数据库和JavaScript不使该信息可用(*)。所以你必须用你的脚本包含一个UCD的副本,这将使它非常庞大。



另一种方法,如果你只需要担心基本的口音拉丁语或西里尔语使用的是仅采用组合音调标记块(U + 0300-U + 036F)。

  function findGraphemesNotVeryWell {s} { 
var re = /.[\\\̀-\\\ͯ]*/g;
var match,matches = [];
while(match = re.exec(s))
matches.push(match [0]);
return matches;
}

findGraphemesNotVeryWell('АааБббввГ㥴Дд');
[А,а,а,Б,б,б,В,в,в,Г,г Ґ,ґ,Д,д]

>可能是一种通过让浏览器渲染字符串并测量其中的选择的位置来提取信息的方式...但它肯定会非常混乱,很难工作跨浏览器。)


I have this Unicode string: Ааа́Ббб́Ввв́Г㥴Дд

And I want to it split by chars. Right now if I try to loop truth all chars I get something like this:
A a a ' Б ...

Is there a way to properly split this string to chars: А а а́ ?

解决方案

To do this properly, what you want is the algorithm for working out the grapheme cluster boundaries, as defined in UAX 29. Unfortunately this requires knowledge of which characters are members of which classes, from the Unicode Character Database, and JavaScript doesn't make that information available(*). So you'd have to include a copy of the UCD with your script, which would make it pretty bulky.

An alternative if you only need to worry about the basic accents used by Latin or Cyrillic would be to take only the Combining Diacritical Marks block (U+0300-U+036F). This would fail for other languages and symbols, but might be enough for what you want to do.

function findGraphemesNotVeryWell(s) {
    var re= /.[\u0300-\u036F]*/g;
    var match, matches= [];
    while (match= re.exec(s))
        matches.push(match[0]);
    return matches;
}

findGraphemesNotVeryWell('Ааа́Ббб́Ввв́Г㥴Дд');
["А", "а", "а́", "Б", "б", "б́", "В", "в", "в́", "Г", "г", "Ґ", "ґ", "Д", "д"]

(*: there might be a way to extract the information by letting the browser render the string, and measuring the positions of selections in it... but it would surely be very messy and difficult to get working cross-browser.)

这篇关于由字符串拆分的Unicode字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆