C ++ unicode UTF-16编码 [英] C++ unicode UTF-16 encoding

查看：133 发布时间：2017/8/16 23:14:42 c++ unicode encoding utf-16

本文介绍了C ++ unicode UTF-16编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个宽字符串是Lhao123--我的上网主页，它必须编码为hao123 - \\\我\\\的\\\上\\\网\\\主\\\页。我被告知，编码的字符串是用于对Unicode UTF-16代码点进行编码的特殊％uNNNN格式。在本网站中，它告诉我是JavaScript转义。但是我不知道如何用C ++编码它。

有任何图书馆可以让这个工作吗？或者给我一些提示。

感谢我的朋友！

解决方案

在字符串文字中嵌入unicode通常不是一个好主意，不可移植;不能保证wchar_t为16位，编码为UTF-16。虽然在Windows上使用Microsoft Visual C ++（特定的C ++实现）可能是这种情况，但wchar_t在OS X的GCC（另一个实现）上是32位。如果您有某种本地化的字符串常量，最好使用某些特定编码中的配置文件，并将其解释为已编码为该编码。国际零件组件（ICU）库为解释和处理unicode提供了非常好的支持。另一个用于在（但不是解释）编码格式之间转换的好图书馆是 libiconv 。

编辑

有可能我误解了你的问题吗？如果问题是你有一个UTF字符串-16已经存在，并且要将其转换为unicode-escape ASCII（即ASCII字符串，其中unicode字符由\u表示，后跟字符的数字值），然后使用以下伪代码对于由UTF-16编码字符串表示的每个代码点，pre

$ b如果代码点在[0,0x7F ]：
发出转码为char
else的代码点：$ b $ b发出\，后跟代表代码点的十六进制数字

现在，要获得代码点，有一个非常简单的规则... UTF-16字符串中的每个元素都是一个代码点，除非它是代理的一部分对，在这种情况下，它和它之后的元素包括单个代码点。如果是这样，则unicode标准定义了将引导代理和后置替代组合成单个代码点的过程。请注意，UTF-8和UTF-16都是可变长度编码...如果不能用可变长度表示，则代码点需要32位。 Unicode转换格式（UTF）FAQ 解释了编码以及如何识别代理对和如何将它们组合成代码点。

I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special "%uNNNN" format for encoding Unicode UTF-16 code points. In this website, it tells me it's JavaScript escapes. But I don't know how to encode it with C++.

It there any library to get this to work? or give me some tips.

Thanks my friends!

解决方案

Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Another good library for converting between (but not interpreting) encoding formats is libiconv.

Edit
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (i.e. an ASCII string where unicode characters are represented by "\u" followed by the numeric value of the character), then use the following pseudo-code:

for each codepoint represented by the UTF-16 encoded string:
    if the codepoint is in the range [0,0x7F]:
       emit the codepoint casted to a char
    else:
       emit "\u" followed by the hexadecimal digits representing codepoint

Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints.

这篇关于C ++ unicode UTF-16编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C ++ unicode UTF-16编码 [英] C++ unicode UTF-16 encoding

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

C ++ unicode UTF-16编码 [英] C++ unicode UTF-16 encoding

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭