在C数组多字节UTF-8 ++ [英] Multi-Byte UTF-8 in Arrays in C++
问题描述
我一直有麻烦3个字节的Uni code UTF-8字符数组工作。当他们在字符数组我得到多字符字符常量和隐式转换不断警告,但是当我使用wchar_t的阵列,wcout都没有返回值。由于项目性质的,它必须是一个数组,并且不是字符串。以下是我一直在试图做一个例子。
I have been having trouble working with 3-byte Unicode UTF-8 characters in arrays. When they are in char arrays I get multi-character character constant and implicit constant conversion warnings, but when I use wchar_t arrays, wcout returns nothing at all. Because of the nature of the project, it must be an array and not a string. Below is an example of what I've been trying to do.
#include <iostream>
#include <string>
using namespace std;
int main()
{
wchar_t testing[40];
testing[0] = L'\u0B95';
testing[1] = L'\u0BA3';
testing[2] = L'\u0B82';
testing[3] = L'\0';
wcout << testing[0] << endl;
return 0;
}
有什么建议?我与OSX合作。
Any suggestions? I'm working with OSX.
推荐答案
由于'\\ u0B95
需要3个字节,它被认为是一个的多字符文字。一个多字符常量的类型 INT
和实现定义的值。 (事实上,我不认为海湾合作委员会是正确的做到这一点)
Since '\u0B95'
requires 3 bytes, it is considered a multicharacter literal. A multicharacter literal has type int
and an implementation-defined value. (Actually, I don't think gcc is correct to do this)
把→
preFIX字面使得之前有键入 wchar_t的
并具有实现定义的值(它映射到一个值,在执行宽字符集的这是的基本执行宽字符集的实现定义的超集的)。
Putting the L
prefix before the literal makes it have type wchar_t
and has an implementation defined value (it maps to a value in the execution wide-character set which is an implementation defined superset of the basic execution wide-character set).
在C ++ 11的标准为我们提供了更多的Uni code意识到类型和文字。附加的类型是 char16_t
和 char32_t
,其值是统一code code点的再present的字符。它们类似于分别为UTF-16和UTF-32。
The C++11 standard provides us with some more Unicode aware types and literals. The additional types are char16_t
and char32_t
, whose values are the Unicode code-points that represent the character. They are analogous to UTF-16 and UTF-32 respectively.
由于需要字符文字从基本多文种平面存储字符,你需要一个 char16_t
文字。这可以写成,例如, U'\\ u0B95
。因此,你可以写你的code如下,没有任何警告或错误:
Since you need character literals to store characters from the basic multilingual plane, you'll need a char16_t
literal. This can be written as, for example, u'\u0B95'
. You can therefore write your code as follows, with no warnings or errors:
char16_t testing[40];
testing[0] = u'\u0B95';
testing[1] = u'\u0BA3';
testing[2] = u'\u0B82';
testing[3] = u'\0';
不幸的是,I / O库没有很好地与这些新类型的游戏。
Unfortunately, the I/O library does not play nicely with these new types.
如果你不真正需要使用上述字符文字,你可以利用新的UTF-8字符串的:
If you do not truly require using character literals as above, you may make use of the new UTF-8 string literals:
const char* testing = u8"\u0B95\u0BA3\u0B82";
这将带code中的字符为UTF-8。
This will encode the characters as UTF-8.
这篇关于在C数组多字节UTF-8 ++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!