C ++ unicode UTF-16编码 [英] C++ unicode UTF-16 encoding
问题描述
有任何图书馆可以让这个工作吗?或者给我一些提示。
感谢我的朋友!
在字符串文字中嵌入unicode通常不是一个好主意,不可移植;不能保证wchar_t为16位,编码为UTF-16。虽然在Windows上使用Microsoft Visual C ++(特定的C ++实现)可能是这种情况,但wchar_t在OS X的GCC(另一个实现)上是32位。如果您有某种本地化的字符串常量,最好使用某些特定编码中的配置文件,并将其解释为已编码为该编码。 国际零件组件(ICU)库为解释和处理unicode提供了非常好的支持。另一个用于在(但不是解释)编码格式之间转换的好图书馆是 libiconv 。
编辑
有可能我误解了你的问题吗?如果问题是你有一个UTF字符串-16已经存在,并且要将其转换为unicode-escape ASCII(即ASCII字符串,其中unicode字符由\u表示,后跟字符的数字值),然后使用以下伪代码对于由UTF-16编码字符串表示的每个代码点,pre
$ b如果代码点在[0,0x7F ]:
发出转码为char
else的代码点:$ b $ b发出\,后跟代表代码点的十六进制数字
现在,要获得代码点,有一个非常简单的规则... UTF-16字符串中的每个元素都是一个代码点,除非它是代理的一部分对,在这种情况下,它和它之后的元素包括单个代码点。如果是这样,则unicode标准定义了将引导代理和后置替代组合成单个代码点的过程。请注意,UTF-8和UTF-16都是可变长度编码...如果不能用可变长度表示,则代码点需要32位。 Unicode转换格式(UTF)FAQ 解释了编码以及如何识别代理对和如何将它们组合成代码点。
I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special "%uNNNN" format for encoding Unicode UTF-16 code points. In this website, it tells me it's JavaScript escapes. But I don't know how to encode it with C++.
It there any library to get this to work? or give me some tips.
Thanks my friends!
Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Another good library for converting between (but not interpreting) encoding formats is libiconv.
Edit
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (i.e. an ASCII string where unicode characters are represented by "\u" followed by the numeric value of the character), then use the following pseudo-code:
for each codepoint represented by the UTF-16 encoded string: if the codepoint is in the range [0,0x7F]: emit the codepoint casted to a char else: emit "\u" followed by the hexadecimal digits representing codepoint
Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints.
这篇关于C ++ unicode UTF-16编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!