C ++ Visual Studio字符编码问题 [英] C++ Visual Studio character encoding issues

查看:128
本文介绍了C ++ Visual Studio字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法把我的头围在这一个是真正的耻辱的来源...



我正在使用法语版的Visual Studio(2008 ),在法语Windows(XP)。法语强调放入字符串发送到输出窗口被损坏。同时从输出窗口输入 。典型的字符编码问题,我输入ANSI,得到UTF-8作为回报,或者什么的效果。在向输出窗口显示硬编码字符串时,什么设置可以确保字符保留在ANSI中?



编辑:



示例:

  #include< iostream& 

int main()
{
std :: cout< àéêù< std :: endl;

return 0;
}

会在输出中显示:


óúÛ¨


查看愉悦)



我真的很喜欢它:


à éêù



解决方案

你所做的不是c / c ++兼容。 规范在2.2中说明了哪些字符集在来源中有效码。它不是那么多,所有使用的字符都在ascii。所以...下面的所有内容都是关于一个特定的实现(因为它发生了,VC2008在美国本地机器上)。



code> cout 行和输出上的4个字形。因此,问题不是UTF8编码,因为它会将多个源字符组合到较少的字形。



从源字符串到控制台上的显示,所有这些事播放部分:


  1. 您的源文件是什么编码的(即编译器如何看到您的C ++文件)

  2. 您的编译器对字符串文字的处理方式以及它能够理解的源代码编码

  3. << 解释您传递的编码字符串

  4. 控制台期望的编码

  5. 控制台如何将输出转换为字体字形。 / li>

现在...



1和2是相当容易的。它看起来像编译器猜测源文件是什么格式,并将其解码为其内部表示。无论源编码是什么,它都会在当前代码页中生成字符串文字对应的数据块。我没有找到明确的细节/控制。



3更容易。除了控制代码,< 只是将数据传递给char *。



SetConsoleOutputCP 。它应该默认为默认系统代码页。您还可以通过 GetConsoleOutputCP (输入控制方式不同,通过 SetConsoleCP )确定您拥有的是哪一个



5是一个有趣的。我撞了我的头,弄清楚为什么我不能得到é显示正确,使用CP1252(西欧,窗口)。事实证明,我的系统字体没有该字符的字形,并有效地使用我的标准代码页(首都Theta,如果我没有调用SetConsoleOutputCP,我会得到的字形)。要修复它,我不得不改变我在游戏机上使用的字体Lucida控制台(一种真正的类型字体)。



我学到了一些有趣的事情: p>


  • 源的编码无关紧要,只要编译器能够计算出来(值得注意的是,将它改为UTF8并没有改变生成的代码。我的é字符串仍使用CP1252编码为 233 0

  • VC选择字符串文字的代码页





















b

那么...这对你意味着什么?这里有几点建议:




  • 不要在字符串文字中使用非ASCII。使用资源,控制编码。

  • 确保您知道控制台预期的编码,并且您的字体具有表示字符的字形你发送。

  • 如果你想知道你的情况下使用什么编码,我建议打印字符的实际值作为一个整数。 char * a =é; std :: cout< (unsigned int)(unsigned char)a [0] 确实显示了233,对我来说,这恰好是CP1252中的编码。



BTW,如果你得到的是ÓÚÛ¨而不是你粘贴的,那么看起来你的4个字节被解释为 CP850


Not being able to wrap my head around this one is a real source of shame...

I'm working with a French version of Visual Studio (2008), in a French Windows (XP). French accents put in strings sent to the output window get corrupted. Ditto input from the output window. Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect. What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?

EDIT:

Example:

#include <iostream>

int main()
{
std:: cout << "àéêù" << std:: endl;

return 0;
}

Will show in the output:

óúÛ¨

(here encoded as HTML for your viewing pleasure)

I would really like it to show:

àéêù

解决方案

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

  1. What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
  2. What your compiler does with a string literal, and what source encoding it understands
  3. how your << interprets the encoded string you're passing in
  4. what encoding the console expects
  5. how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

  • the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
  • VC is picking a codepage for the string literals that I do not seem to control.
  • controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

  • don't use non-ascii in string literals. Use resources, where you control the encoding.
  • make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
  • if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

这篇关于C ++ Visual Studio字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆