UTF-8 字符串的大小(以字节为单位) [英] Size of UTF-8 string in bytes

查看:72
本文介绍了UTF-8 字符串的大小(以字节为单位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 QString 来存储字符串,现在我需要将这些字符串(转换为 UTF-8 编码)存储在 POD 结构中,如下所示:

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :

template < int N >
struct StringWrapper
{
  char theString[N];
};

要从 QString 转换原始数据,我这样做:

To convert raw data from the QString, I do it like this :

QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );

现在是问题.我注意到如果我从普通字符串转换,它工作正常:

Now the question. I noticed that if I convert from normal string, it works fine :

QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;

将产生作为输出:

abc

但是如果我使用一些特殊字符,例如:

but if I use some special characters, like for example :

QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;

我得到了这样的垃圾:

Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F

我显然遗漏了一些东西,但到底出了什么问题?

I am obviously missing something, but what exactly is wrong?

附加问题

UTF-8 编码字符的最大大小是多少?我在这里阅读了它,它是 4 个字节.

What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.

推荐答案

您需要回答的第一个问题是您的源文件的编码是什么?QString 默认构造函数假定它是 Latin1,除非您使用 QTextStream::setCodecForCStrings() 更改它.因此,如果您的来源不是 Latin1(例如 UTF-8),那么此时您会得到错误的结果:

The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:

QString str( "Schöne Grüße" );

现在,如果您的源代码是 UTF-8,则需要将其替换为:

Now, if your sources are in UTF-8, you need to replace it with:

QString str = QString::fromUtf8( "Schöne Grüße" );

或者,更好的是,尽可能使用 QObject::trUf8(),因为它为您提供 i18n 功能作为免费奖励.

Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.

接下来要检查的是控制台的编码是什么.您尝试向其打印 UTF-8 字符串,但它是否支持 UTF-8?如果是 Windows 控制台,则可能不会.如果在具有某些 *.UTF-8 语言环境的 *nix 系统上使用 Unicode 字体与 xterm 兼容,则应该没问题.

The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.

对于您编辑的问题:

我看不出有任何理由不信任维基百科,尤其是当它涉及特定标准时.不过,它还提到 UTF-8 过去最多有 6 个字节的字符.根据我的经验,使用合理的母语字符(如拉丁文/西里尔文/希伯来文/中文/日文)最多可以获得 3 个字节.4 字节可能用于更奇特的东西,如果您真的很好奇,可以随时查看标准.

I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.

这篇关于UTF-8 字符串的大小(以字节为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆