Unicode字符串的跨平台迭代（使用ICU计数Graphemes） [英] Cross-platform iteration of Unicode string (counting Graphemes using ICU)

查看：259 发布时间：2016/10/13 12:14:25 c++ unicode cross-platform icu

本文介绍了Unicode字符串的跨平台迭代（使用ICU计数Graphemes）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我要迭代Unicode字符串的每个字符，处理每个替代对，并将字符序列组合为一个单位（一个字形）。

I want to iterate each character of a Unicode string, treating each surrogate pair and combining character sequence as a single unit (one grapheme).

文本नमस्ते由代码点组成： U + 0928，U + 092E，U + 0938，U + 094D，U + 0924，U + 0947 ，其中， U + 0938 > U + 0947 是组合标记。

The text "नमस्ते" is comprised of the code points: U+0928, U+092E, U+0938, U+094D, U+0924, U+0947, of which, U+0938 and U+0947 are combining marks.

static void Main(string[] args) { const string s = "नमस्ते"; Console.WriteLine(s.Length); // Ouptuts "6" var l = 0; var e = System.Globalization.StringInfo.GetTextElementEnumerator(s); while(e.MoveNext()) l++; Console.WriteLine(l); // Outputs "4" }

我们还有Win32的 CharNextW（）

So there we have it in .NET. We also have Win32's CharNextW()

#include <Windows.h> #include <iostream> #include <string> int main() { const wchar_t * s = L"नमस्ते"; std::cout << std::wstring(s).length() << std::endl; // Gives "6" int l = 0; while(CharNextW(s) != s) { s = CharNextW(s); ++l; } std::cout << l << std::endl; // Gives "4" return 0; }

问题

两种方式我知道是特定于微软。是否有便携式方法？

我听说过ICU，（ UnicodeString（s）.length（）仍然给出6）。

C ++没有Unicode的概念，因此用于处理这些问题的轻量级跨平台库将是一个可接受的答案，指向ICU中的相关函数/模块。

I heard about ICU but I couldn't find something related quickly (UnicodeString(s).length() still gives 6). Would be an acceptable answer to point to the related function/module in ICU.

C++ doesn't have a notion of Unicode, so a lightweight cross-platform library for dealing with these issues would make an acceptable answer.

McDowell给出了使用来自ICU的 BreakIterator 的提示，我认为这可以被视为处理Unicode的事实上的跨平台标准。以下是一个演示其用法的示例代码（例如令人惊讶的）：

@McDowell gave the hint to use BreakIterator from ICU, which I think can be regarded as the de-facto cross-platform standard to deal with Unicode. Here's an example code to demonstrate its use (since examples are surprisingly rare):

#include <unicode/schriter.h> #include <unicode/brkiter.h> #include <iostream> #include <cassert> #include <memory> int main() { const UnicodeString str(L"नमस्ते"); { // StringCharacterIterator doesn't seem to recognize graphemes StringCharacterIterator iter(str); int count = 0; while(iter.hasNext()) { ++count; iter.next(); } std::cout << count << std::endl; // Gives "6" } { // BreakIterator works!! UErrorCode err = U_ZERO_ERROR; std::unique_ptr<BreakIterator> iter( BreakIterator::createCharacterInstance(Locale::getDefault(), err)); assert(U_SUCCESS(err)); iter->setText(str); int count = 0; while(iter->next() != BreakIterator::DONE) ++count; std::cout << count << std::endl; // Gives "4" } return 0; }

推荐答案

ICU BreakIterator （假定字符实例与Java版本具有相同的功能）。

You should be able to use the ICU BreakIterator for this (the character instance assuming it is feature-equivalent to the Java version).

这篇关于Unicode字符串的跨平台迭代（使用ICU计数Graphemes）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Unicode字符串的跨平台迭代（使用ICU计数Graphemes） [英] Cross-platform iteration of Unicode string (counting Graphemes using ICU)

问题描述

问题

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

Unicode字符串的跨平台迭代（使用ICU计数Graphemes） [英] Cross-platform iteration of Unicode string (counting Graphemes using ICU)

问题描述

问题

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭