Unicode字符串的跨平台迭代(使用ICU计数Graphemes) [英] Cross-platform iteration of Unicode string (counting Graphemes using ICU)

查看:259
本文介绍了Unicode字符串的跨平台迭代(使用ICU计数Graphemes)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要迭代Unicode字符串的每个字符处理每个替代对,并将字符序列组合为一个单位(一个字形)。

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

文本नमस्ते由代码点组成: U + 0928,U + 092E,U + 0938,U + 094D,U + 0924,U + 0947 ,其中, U + 0938 > U + 0947 是组合标记

The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks.

static void Main(string[] args)
{
    const string s = "नमस्ते";

    Console.WriteLine(s.Length); // Ouptuts "6"

    var l = 0;
    var e = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while(e.MoveNext()) l++;
    Console.WriteLine(l); // Outputs "4"
}

我们还有Win32的 CharNextW()

So there we have it in .NET. We also have Win32's CharNextW()

#include <Windows.h>
#include <iostream>
#include <string>

int main()
{
    const wchar_t * s = L"नमस्ते";

    std::cout << std::wstring(s).length() << std::endl; // Gives "6"

    int l = 0;
    while(CharNextW(s) != s)
    {
        s = CharNextW(s);
        ++l;
    }

    std::cout << l << std::endl; // Gives "4"

    return 0;
}



问题



两种方式我知道是特定于微软。是否有便携式方法?


  • 我听说过ICU, ( UnicodeString(s).length()仍然给出6)。

  • C ++没有Unicode的概念,因此用于处理这些问题的轻量级跨平台库将是一个可接受的答案,指向ICU中的相关函数/模块。

  • I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU.
  • C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

McDowell给出了使用来自ICU的 BreakIterator 的提示,我认为这可以被视为处理Unicode的事实上的跨平台标准。以下是一个演示其用法的示例代码(例如令人惊讶的):

@McDowell gave the hint to use BreakIterator from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):

#include <unicode/schriter.h>
#include <unicode/brkiter.h>

#include <iostream>
#include <cassert>
#include <memory>

int main()
{
    const UnicodeString str(L"नमस्ते");

    {
        // StringCharacterIterator doesn't seem to recognize graphemes
        StringCharacterIterator iter(str);
        int count = 0;
        while(iter.hasNext())
        {
            ++count;
            iter.next();
        }
        std::cout << count << std::endl; // Gives "6"
    }

    {
        // BreakIterator works!!
        UErrorCode err = U_ZERO_ERROR;
        std::unique_ptr<BreakIterator> iter(
            BreakIterator::createCharacterInstance(Locale::getDefault(), err));
        assert(U_SUCCESS(err));
        iter->setText(str);

        int count = 0;
        while(iter->next() != BreakIterator::DONE) ++count;
        std::cout << count << std::endl; // Gives "4"
    }

    return 0;
}


推荐答案

ICU BreakIterator (假定字符实例与Java版本具有相同的功能)。

You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).

这篇关于Unicode字符串的跨平台迭代(使用ICU计数Graphemes)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆