UTF-8 字符串的大小(以字节为单位) [英] Size of UTF-8 string in bytes

查看：72 发布时间：2021/9/15 19:45:05 c++ qt utf-8

本文介绍了UTF-8 字符串的大小(以字节为单位)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 QString 来存储字符串，现在我需要将这些字符串(转换为 UTF-8 编码)存储在 POD 结构中，如下所示:

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :

template < int N >
struct StringWrapper
{
  char theString[N];
};

要从 QString 转换原始数据，我这样做:

To convert raw data from the QString, I do it like this :

QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );

现在是问题.我注意到如果我从普通字符串转换，它工作正常:

Now the question. I noticed that if I convert from normal string, it works fine :

QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;

将产生作为输出:

abc

但是如果我使用一些特殊字符，例如:

but if I use some special characters, like for example :

QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;

我得到了这样的垃圾:

Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F

我显然遗漏了一些东西，但到底出了什么问题?

I am obviously missing something, but what exactly is wrong?

附加问题

UTF-8 编码字符的最大大小是多少?我在这里阅读了它，它是 4 个字节.

What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.

推荐答案

您需要回答的第一个问题是您的源文件的编码是什么?QString 默认构造函数假定它是 Latin1，除非您使用 QTextStream::setCodecForCStrings() 更改它.因此，如果您的来源不是 Latin1(例如 UTF-8)，那么此时您会得到错误的结果:

The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:

QString str( "Schöne Grüße" );

现在，如果您的源代码是 UTF-8，则需要将其替换为:

Now, if your sources are in UTF-8, you need to replace it with:

QString str = QString::fromUtf8( "Schöne Grüße" );

或者，更好的是，尽可能使用 QObject::trUf8()，因为它为您提供 i18n 功能作为免费奖励.

Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.

接下来要检查的是控制台的编码是什么.您尝试向其打印 UTF-8 字符串，但它是否支持 UTF-8?如果是 Windows 控制台，则可能不会.如果在具有某些 *.UTF-8 语言环境的 *nix 系统上使用 Unicode 字体与 xterm 兼容，则应该没问题.

The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.

对于您编辑的问题:

我看不出有任何理由不信任维基百科，尤其是当它涉及特定标准时.不过，它还提到 UTF-8 过去最多有 6 个字节的字符.根据我的经验，使用合理的母语字符(如拉丁文/西里尔文/希伯来文/中文/日文)最多可以获得 3 个字节.4 字节可能用于更奇特的东西，如果您真的很好奇，可以随时查看标准.

I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.

这篇关于UTF-8 字符串的大小(以字节为单位)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

UTF-8 字符串的大小(以字节为单位) [英] Size of UTF-8 string in bytes

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

UTF-8 字符串的大小(以字节为单位) [英] Size of UTF-8 string in bytes

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭