VC ++编译器/source-charset:utf-8不起作用 [英] VC++ compiler /source-charset:utf-8 doesn't work
问题描述
虽然我在Visual Studio中的utf-8下试验代码单元,但遇到了很多陷阱:
While I am experimenting code units under utf-8 in Visual Studio, I entercountered many pitfalls:
-
默认情况下,VS使用与系统区域相关的编码保存源文件,对我来说,它是GB2312(中文代码为936页).
By default, VS save the source file with system region related encoding, for me , it's GB2312(codepage 936, a Chinese encoding).
解决方案:我使用另存为,并使用没有签名的UTF-8保存文件.
Solution: I use save as and save the file with UTF-8 without signature.
然后我发现默认情况下,编译器也使用与系统区域相关的编码来解释源文件,它仍然是GB2312,所以我得到了令人困惑的警告和语法错误.
Then I found that by default the compiler interpret the source file with system region related encoding too, which it's still GB2312, so I got puzzling warning and syntax error.
解决方案:我使用/source-charset:utf-8
进行编译,没有警告和错误.但是大小结果为 2 (GB2312中的知"以2个代码单元编码).但这应该是utf-8下的 3 .
Solution: I use /source-charset:utf-8
to compile, no warning and error. But the size result it's 2('知' in GB2312 is encoded with 2 code units). But it should be 3 under utf-8.
知" Unicode参考 https://unicode-table.com/en/77E5/
'知' Unicode reference https://unicode-table.com/en/77E5/
(我认为您可以使用当前系统编码和utf-8中都存在但具有不同代码单位大小的任何字符来进行类似的测试.)
代码:
#include <iostream>
#include <string>
using namespace std;
int main(){
string s = "知";
cout << s.size() <<endl;
cout << s << endl;
}
此外,Windows cmd以及powershell也使用与系统区域相关的编码(在cmd中键入chcp
).所以我不能打印ə
之类的字符.
Moreover, the Windows cmd as well as powershell use the system region related encoding too (type chcp
in cmd). So I can't print characters like ə
.
所以我需要注意三件事:
So there's three stuff I need to take care about:
- 源文件编码
- 编译器是否按预期解释了源文件
- 即使满足1.和2.,cmd也可能无法显示字符.
此外,我从这种经历中得到了一些困惑:
Besides, I have some confusion derived from this experience:
-
为什么Windows会这样?可以使用utf-8设置所有内容吗?我将相同的文件复制到Mac,一切正常.而且,设置Mac的终端编码非常容易.
Why Windows acts like this? Can it just set everything with utf-8? I copied the same file to Mac and everything works as expected. And it's very easy to set Mac's terminal encoding.
我发现有些帖子说原因是某些编码标准(例如GB2312)是在utf-8发布之前创建的.而且其中许多与utf-8不兼容.因此,它继续用于兼容性.
Some posts I found said the reason is that some encoding standards (like this GB2312) are created before utf-8 come out. And many of them are not compatible with utf-8. So it continues to use for compatibility.
但是我不知道这种不兼容会如何发生?例如我下载了 NotePad ++ 并安装了所有语言包.我的系统的编码为GB2312,但是我仍然可以将NotePad ++的显示语言更改为日语,并且显示效果很好.不是像????
这样的东西.
But I wonder how the incompatibility would occur? e.g. I download NotePad++ and install all the language packages. My system's encoding is GB2312, but I can still change the display language of NotePad++ to Japanese and it displays well. Not such thing like ????
.
推荐答案
此处的源字符集"不是巧合. C ++标准明确区分(基本)源字符集(96个常用字符,全部以纯ASCII格式找到)和执行字符集.
The term "source charset" is no coincidence here. The C++ standard explicitly differentiates between the (basic) source character set (96 common characters, all found in plain ASCII) and the execution character set.
由于您使用UTF-8作为源字符集,因此知
被映射到\u77E5
.
Since you used UTF-8 as the source character set, 知
is mapped to \u77E5
.
但是,在运行时,您正在使用执行字符集. VC ++ /source-charset
选项不会影响VC ++的执行字符集;它不会影响VC ++的执行字符集.为此,有一个/execution-charset
At runtime, however, you're using the execution character set. The VC++ /source-charset
option does not affect VC++'s execution character set; for that there is an /execution-charset
但是正如@Matteo Italia已经指出的那样,在UTF-8 I/O方面,众所周知VC ++运行时有点不稳定. std::string.size
应该可以,但std::cout
可能不能.
But as @Matteo Italia already notes, the VC++ runtime is known to be more than a little bit flaky when it comes to UTF-8 I/O. std::string.size
should work but std::cout
might not.
这篇关于VC ++编译器/source-charset:utf-8不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!