C ++ unicode UTF-16编码 [英] C++ unicode UTF-16 encoding

查看:201
本文介绍了C ++ unicode UTF-16编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个宽字符串是Lhao123--我的上网主页,并且必须编码为hao123 - \\\我\\\的\\\上\\\网\\\主\\\页 。我被告知,编码字符串是一种特殊的%uNNNN格式,用于编码Unicode UTF-16代码点。在此网站中,它告诉我它是JavaScript转义。但我不知道如何用C ++编码。



它有任何库,以使这工作?

解决方案

感谢我的朋友们

在字符串文字中嵌入unicode通常不是一个好主意,并且不可移植;不能保证wchar_t将是16位,并且编码将是UTF-16。虽然在Windows上使用Microsoft Visual C ++(特定的C ++实现)可能是这种情况,但是在OS X的GCC(另一个实现)上,wchar_t是32位。如果你有一些本地化的字符串常量,最好在一些特定的编码中使用配置文件,并将它们解释为在该编码中编码。 Unicode国际组件(ICU)库为解释和处理unicode提供了很好的支持。另一个用于在(但不是解释)编码格式之间转换的好库是 libiconv



编辑

这可能是我错误的解释你的问题...如果问题是你有一个字符串UTF- 16,并且要将其转换为unicode-escape ASCII(即,Unicode字符串,其中unicode字符由\u表示,后跟字符的数字值),然后使用以下伪代码:

 
对于由UTF-16编码的字符串表示的每个代码点:$ b​​ $ b如果代码点在[0,0x7F] :
发出转换为char的代码点
else:
发出\u,后跟表示代码点的十六进制数字

现在,为了得到代码点,有一个非常简单的规则... UTF-16字符串中的每个元素都是一个代码点,除非它是代理对的一部分,在这种情况下并且在它之后的元素包括单个码点。如果是这样,则unicode标准定义用于将前导代理和后代代理组合成单个代码点的过程。注意,UTF-8和UTF-16都是可变长度编码...如果不用可变长度表示,代码点需要32位。 Unicode转换格式(UTF)常见问题解答说明了编码以及如何识别代理对和如何以将它们组合为代码点。


I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special "%uNNNN" format for encoding Unicode UTF-16 code points. In this website, it tells me it's JavaScript escapes. But I don't know how to encode it with C++.

It there any library to get this to work? or give me some tips.

Thanks my friends!

解决方案

Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Another good library for converting between (but not interpreting) encoding formats is libiconv.

Edit
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (i.e. an ASCII string where unicode characters are represented by "\u" followed by the numeric value of the character), then use the following pseudo-code:

for each codepoint represented by the UTF-16 encoded string:
    if the codepoint is in the range [0,0x7F]:
       emit the codepoint casted to a char
    else:
       emit "\u" followed by the hexadecimal digits representing codepoint

Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints.

这篇关于C ++ unicode UTF-16编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆