Unicode转换问题 [英] Unicode conversion issues

查看:246
本文介绍了Unicode转换问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个关于Unicode的初学者问题。我使用Embarcadero C ++ Builder 2009,他们应该把默认字符串改为使用Unicode。

Here is a beginner question on Unicode. I'm using Embarcadero C++ Builder 2009, where they supposedly changed the default strings to use Unicode.


  • 我在源代码编辑器中键入各种符号

  • 我的程序使用String类型的C ++ Builder来获取用户输入。

  • 我还通过为wchar_t设置值来手动添加输入。

看起来,符号被解释。有时我得到一个符号,例如代码0x00C7('Ç'),但有时相同的符号被编码为0xFFC7,例如在源代码编辑器中。对我的理解,前者是正确的Unicode,后者是别的东西。有人可以确认吗?

It would seem that there are conflicts in how the symbols are interpreted. Sometimes I get a symbol with for example the code 0x00C7 ('Ç'), but sometimes the same symbol is coded as 0xFFC7, for example in the source code editor. To my understanding, the former is proper Unicode, the latter is "something else". Can someone confirm this?

我不知道这个其他编码是从哪里来的,以及如何摆脱它?

I wonder where this "something else" encoding is coming from, and how to get rid of it?

编辑:进一步的研究:似乎一个地方,0xFF **编码出现是当我做这样的事情:

Further research: it seems that one place where the 0xFF** encoding appears is when I do something like this:

string str = ...;
wchar_t wch = (wchar_t)str[i];

相同的结果,无论是std :: string还是VCL String。 wchar_t 与Unicode不一样?

Same result no matter if it is std::string or VCL String. Is wchar_t not the same as Unicode?

推荐答案

问题是在你的编译器 char 是有符号的(标准允许它是签名或无符号,它的实现定义/具体)。因此,每当将位7设置为1(0x80到0xFF)的字符转换为任何更大的整数类型时,它将被视为负值,并且它被进行符号扩展以保留负值,或者换句话说,位7被复制到位8,位9等等,成为较大整数类型的所有较高位。所以,0xC7可以变成0xFFC7和0xFFFFFFC7。要防止这种情况发生,请先将 chars 转换为 unsigned chars

I'm guessing the problem is that in your compiler char is signed (the standard allows it to be either signed or unsigned, it's implementation-defined/specific). As such, whenever you convert chars that have bit 7 set to 1 (0x80 through 0xFF) into any larger integer type, it's treated as a negative value and it gets sign-extended to preserve the negative value, or, in other words, this bit 7 gets copied to bit 8, bit 9 and so on, into all higher bits of the bigger integer type. So, 0xC7 can turn into 0xFFC7 and 0xFFFFFFC7. To prevent that from happening, cast chars to unsigned chars first.

这篇关于Unicode转换问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆