如何获得某个UTF-8语言组中的所有字符? [英] How to get all characters within a certain UTF-8 language group?

查看:303
本文介绍了如何获得某个UTF-8语言组中的所有字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道确切的技术术语,但是作为标准的UTF-8包含了某些语言分组中的字符,可以在Windows中查看。 org / wiki / Character_Maprel =nofollow>字符映射 ,字体类似于 Arial Unicode MS


  • 拉丁语

  • 西里尔语

  • 希腊语

  • 阿拉伯语

  • Devnagari

  • 古吉拉提

  • 卡纳达

  • 老挝

  • 平假名

  • 货币符号

  • $ b


如何获取每个集合下的字符列表?这可能是一个API或只是一个普通的清单/数据库在网络上的某个地方。我找到 wiki文章,列出了所有内容,但不是以可迭代的形式。任何想法?

解决方案

您可以访问已发布的 UnicodeData.txt ,它是一个CSV格式的文件,列出每个字符与组信息。



按类别分组

第三列指定字符类,以2位短格式在这里指定的longforms




  • 字母字符 - 类Lu,Ll,Lt,Lm,Lo或Nl

  • 组合字符 - 类Mn或Mc

  • 小数位字符 - class Nd

  • 连接字符 - class Pc c $ c $>
    $ b

    它甚至可以迭代ch

      var charInfo = Enumerable.Range(0,0x110000)
    .Where (x => x < 0x00d800 || x> 0x00dfff)
    .Select(char.ConvertFromUtf32)
    .GroupBy(s => char.GetUnicodeCategory(s,0))
    .ToDictionary(g => g.Key);
    $ b foreach(charInfo [UnicodeCategory.LowercaseLetter]中的var ch)
    {
    Console.Write(ch);



    $ b $ h
    $ b $ h

    <但是,语言分组没有明确提到,所以你必须解析名字的第一个单词,以便按照语言对每个字符进行分组。这是最可靠的方法,因为每个拉丁字符都以前缀Latin开头。示例如下:


    • 拉丁语:拉丁语大写字母A

    • 拉丁字母扩展A:拉丁文小写字母C

    • 拉丁文扩展B:拉丁文大写字母六音

    • 拉丁文扩展其他:拉丁文大写字母B


    I don't know the exact technical terminology, but UTF-8 as a standard includes characters from certain language groupings, which can be observed in the Windows Character Map with a font like Arial Unicode MS.

    • Latin
    • Cyrillic
    • Greek
    • Hebrew
    • Arabic
    • Devnagari
    • Gujrati
    • Kannada
    • Lao
    • Hiragana
    • Currency Symbols
    • Box Drawings

    How do I obtain a list of the characters under each set? This could be an API or just a plain list/DB somewhere on the net. I found the wiki article that lists everything, but not in an iterable form. Any ideas?

    解决方案

    You can access the entire list of unicode chars at the published UnicodeData.txt which is a CSV formatted file listing every character with group information.

    Grouping by class

    The third column specifies the character class, in a 2 digit shortform, longforms specified here.

    • letter-character -- classes Lu, Ll, Lt, Lm, Lo, or Nl
    • combining-character -- classes Mn or Mc
    • decimal-digit-character -- class Nd
    • connecting-character -- class Pc
    • formatting-character -- class Cf

    Its even possible to iterate through chars of a certain group using C# LINQ:

    var charInfo = Enumerable.Range(0, 0x110000)
                             .Where(x => x < 0x00d800 || x > 0x00dfff)
                             .Select(char.ConvertFromUtf32)
                             .GroupBy(s => char.GetUnicodeCategory(s, 0))
                             .ToDictionary(g => g.Key);
    
    foreach (var ch in charInfo[UnicodeCategory.LowercaseLetter])
    {
        Console.Write(ch);
    }
    

    Grouping by language

    However, the language grouping is not explicitly mentioned so you'll have to parse the first word of the name to group each char by language. This is the most reliable method to do so, since every Latin unicode character begins with the prefix "Latin". Examples follow:

    • Latin: Latin Capital Letter A
    • Latin Extended A: Latin Small Letter C with acute
    • Latin Extended B: Latin Capital Letter Tone Six
    • Latin Extended Additional: Latin Capital Letter B With Dot Above

    这篇关于如何获得某个UTF-8语言组中的所有字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆