C ++中的Unicode字符串索引 [英] Unicode string indexing in C++

查看:77
本文介绍了C ++中的Unicode字符串索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我来自python,您可以在其中使用'string [10]'来顺序访问字符.如果字符串是用Unicode编码的,它将给我带来预期的结果.但是,当我在C ++中对字符串使用索引时,只要字符是ASCII即可工作,但是当我在字符串中使用Unicode字符并使用索引时,在输出中我将得到类似/201的八进制表示形式.例如:

I come from python where you can use 'string[10]' to access a character in sequence. And if the string is encoded in Unicode it will give me expected results. However when I use indexing on a string in C++, as long the characters are ASCII it works, but when I use a Unicode character inside the string and use indexing, in the output I'll get an octal representation like /201. For example:

string ramp = "ÐðŁłŠšÝýÞþŽž";
cout << ramp << "\n";    
cout << ramp[5] << "\n";

输出:

ÐðŁłŠšÝýÞþŽž
/201

为什么会发生这种情况,如何在字符串表示形式中访问该字符,或者如何将八进制表示形式转换为实际字符?

Why this is happening and how can I access that character in the string representation or how can I convert the octal representation to the actual character?

推荐答案

标准C ++不能正确处理Unicode,给您带来类似于您所观察到的问题.

Standard C++ is not equipped for proper handling of Unicode, giving you problems like the one you observed.

这里的问题是C ++ 早于 Unicode.这意味着即使您的字符串文字也将以实现定义的方式进行解释,因为这些字符未在基本源字符"集中定义(基本上是ASCII-7字符减去 @ $ 和反引号).

The problem here is that C++ predates Unicode by a comfortable margin. This means that even that string literal of yours will be interpreted in an implementation-defined manner because those characters are not defined in the Basic Source Character set (which is, basically, the ASCII-7 characters minus @, $, and the backtick).

C ++ 98完全没有提到Unicode.它提到 wchar_t wstring 基于它,并指定 wchar_t 能够表示当前语言环境中的任何字符".但是那造成的伤害大于好处...

C++98 does not mention Unicode at all. It mentions wchar_t, and wstring being based on it, specifying wchar_t as being capable of "representing any character in the current locale". But that did more damage than good...

Microsoft将 wchar_t 定义为16位,这足以满足当时 的Unicode代码点.但是,此后Unicode扩展到了16位范围之外... Windows的16位 wchar_t 不再宽"了,因为您需要其中两个来表示 BMP ,并且Microsoft文档对于 wchar_t 表示UTF-16(带有代理对的多字节编码)或UCS-2(宽编码,不支持BMP以外的字符).

Microsoft defined wchar_t as 16 bit, which was enough for the Unicode code points at that time. However, since then Unicode has been extended beyond the 16-bit range... and Windows' 16-bit wchar_t is not "wide" anymore, because you need two of them to represent characters beyond the BMP -- and the Microsoft docs are notoriously ambiguous as to where wchar_t means UTF-16 (multibyte encoding with surrogate pairs) or UCS-2 (wide encoding with no support for characters beyond the BMP).

一直以来,Linux wchar_t 是32位的,它的 宽度足以容纳UTF-32 ...

All the while, a Linux wchar_t is 32 bit, which is wide enough for UTF-32...

C ++ 11对该主题进行了重大改进,添加了 char16_t char32_t 及其相关的 string 变体,以消除歧义,但仍无法完全支持Unicode操作.

C++11 made significant improvements to the subject, adding char16_t and char32_t including their associated string variants to remove the ambiguity, but still it is not fully equipped for Unicode operations.

仅举一个例子,尝试转换例如德语Fuß"为大写字母,您将明白我的意思.(单个字母'ß'需要扩展为'SS',标准功能-一次处理一个字符,一次处理一个字符-不能做.)

Just as one example, try to convert e.g. German "Fuß" to uppercase and you will see what I mean. (The single letter 'ß' would need to expand to 'SS', which the standard functions -- handling one character in, one character out at a time -- cannot do.)

但是, 有帮助 . Unicode的国际组件(ICU) 完全可以处理C ++中的Unicode.至于在源代码中指定特殊字符,则必须使用 u8" u" U" 来强制解释使用八进制/十六进制转义或依靠您的编译器实现来适当地处理非ASCII-7编码的字符串文字分别为UTF-8,UTF-16和UTF-32.

However, there is help. The International Components for Unicode (ICU) library is fully equipped to handle Unicode in C++. As for specifying special characters in source code, you will have to use u8"", u"", and U"" to enforce interpretation of the string literal as UTF-8, UTF-16, and UTF-32 respectively, using octal / hexadecimal escapes or relying on your compiler implementation to handle non-ASCII-7 encodings appropriately.

即使这样,您也将获得 std :: cout<<的整数值.ramp [5] ,因为对于C ++,字符只是具有语义含义的整数.ICU的 ustream.h icu :: UnicodeString 类提供了 operator<< 重载,但提供了 ramp [5] 只是一个16位无符号整数(1),如果他们的 unsigned short 突然被解释为字符,人们会向您问.您需要 C-API u_fputs()/ u_printf()/ u_fprintf()函数.

And even then you will get an integer value for std::cout << ramp[5], because for C++, a character is just an integer with semantic meaning. ICU's ustream.h provides operator<< overloads for the icu::UnicodeString class, but ramp[5] is just a 16-bit unsigned integer (1), and people would look askance at you if their unsigned short would suddenly be interpreted as characters. You need the C-API u_fputs() / u_printf() / u_fprintf() functions for that.

#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <unicode/ustdio.h>

#include <iostream>

int main()
{
    // make sure your source file is UTF-8 encoded...
    icu::UnicodeString ramp( icu::UnicodeString::fromUTF8( "ÐðŁłŠšÝýÞþŽž" ) );
    std::cout << ramp << "\n";
    std::cout << ramp[5] << "\n";
    u_printf( "%C\n", ramp[5] );
}

使用 g ++ -std = c ++ 11 testme.cpp -licuio -licuuc 编译.

ÐðŁłŠšÝýÞþŽž
353
š


(1)ICU在内部使用UTF-16,并且 UnicodeString :: operator [] 返回一个代码 unit ,而不是一个代码 point ,因此您可能最终只能获得代理对的一半.查找 API文档,以了解索引unicode字符串的各种其他方式.


(1) ICU uses UTF-16 internally, and UnicodeString::operator[] returns a code unit, not a code point, so you might end up with one half of a surrogate pair. Look up the API docs for the various other ways to index a unicode string.

这篇关于C ++中的Unicode字符串索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆