UTF-8 字符串的大小(以字节为单位) [英] Size of UTF-8 string in bytes
问题描述
我使用 QString 来存储字符串,现在我需要将这些字符串(转换为 UTF-8 编码)存储在 POD 结构中,如下所示:
I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
要从 QString 转换原始数据,我这样做:
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
现在是问题.我注意到如果我从普通字符串转换,它工作正常:
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
将产生作为输出:
abc
但是如果我使用一些特殊字符,例如:
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
我得到了这样的垃圾:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
我显然遗漏了一些东西,但到底出了什么问题?
I am obviously missing something, but what exactly is wrong?
附加问题
UTF-8 编码字符的最大大小是多少?我在这里阅读了它,它是 4 个字节.
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.
推荐答案
您需要回答的第一个问题是您的源文件的编码是什么?QString 默认构造函数假定它是 Latin1,除非您使用 QTextStream::setCodecForCStrings() 更改它.因此,如果您的来源不是 Latin1(例如 UTF-8),那么此时您会得到错误的结果:
The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
现在,如果您的源代码是 UTF-8,则需要将其替换为:
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
或者,更好的是,尽可能使用 QObject::trUf8(),因为它为您提供 i18n 功能作为免费奖励.
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
接下来要检查的是控制台的编码是什么.您尝试向其打印 UTF-8 字符串,但它是否支持 UTF-8?如果是 Windows 控制台,则可能不会.如果在具有某些 *.UTF-8 语言环境的 *nix 系统上使用 Unicode 字体与 xterm 兼容,则应该没问题.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
对于您编辑的问题:
我看不出有任何理由不信任维基百科,尤其是当它涉及特定标准时.不过,它还提到 UTF-8 过去最多有 6 个字节的字符.根据我的经验,使用合理的母语字符(如拉丁文/西里尔文/希伯来文/中文/日文)最多可以获得 3 个字节.4 字节可能用于更奇特的东西,如果您真的很好奇,可以随时查看标准.
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.
这篇关于UTF-8 字符串的大小(以字节为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!