如何在C ++中使用Unicode? [英] How to use Unicode in C++?

查看:203
本文介绍了如何在C ++中使用Unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设一个非常简单的程序:

Assuming a very simple program that:


  • 请求名称。


这很简单,

但我的问题是,如果我使用日语字符输入名称,我不知道如何做同样的事情。

But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.

所以,如果你知道如何在C ++中做这个,请给我一个例子(我可以编译和测试)

So, if you know how to do this in C++, please show me an example (that I can compile and test)

感谢。

user362981:感谢您的帮助。我编译的代码,你写的没有问题,他们的控制台窗口出现,我不能输入任何日语字符(使用IME)。此外,如果
我将你的代码(hello)中的一个单词改为包含日语字符的单词,它也不会显示这些。

user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.

Svisstack:谢谢你的帮助。但是当我编译你的代码,我得到以下错误:

Svisstack : Also thanks for your help. But when I compile your code I get the following error:

warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'


推荐答案

关于宽字符的答案。宽字符,特别是 wchar_t 不等于Unicode 。你可以使用它们(有一些陷阱)来存储Unicode,就像你可以使用 unsigned char wchar_t 非常依赖于系统。引用 Unicode标准版本5.2,第5章:

You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:


使用 wchar_t 宽字符类型,ANSI / ISO C提供
包含固定宽度,宽字符。 ANSI / ISO C将宽
字符集的语义留给特定实现,但要求来自便携式C执行集的字符通过零扩展与其宽字符等效对应。

With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.


wchar_t 是编译器特定的,可以小到8位。因此,需要在任何C或C ++编译器中移植的
程序不应使用 wchar_t
来存储Unicode文本。 wchar_t 类型用于存储编译器定义的宽
字符,在某些编译器中可能是Unicode字符。

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

所以,它的实现定义。这里有两个实现:在Linux上, wchar_t 是4个字节宽,并且表示UTF-32编码中的文本(不考虑当前语言环境)。 (根据你的系统,无论是BE还是LE。)然而,Windows有一个2字节宽 wchar_t ,代表UTF-16代码单元。完全不同。

So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.

更好的路径:了解语言环境,因为您需要知道。例如,因为使用UTF-8(Unicode),以下程序将使用Unicode:

A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:

#include <iostream>

int main()
{
    setlocale(LC_ALL, "");
    std::cout << "What's your name? ";
    std::string name;
    std::getline(std::cin, name);
    std::cout << "Hello there, " << name << "." << std::endl;
    return 0;
}

...

$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8

但没有Unicode关于它。它只是读取字符,它作为UTF-8 ,因为我有我的环境设置。我可以很容易地说heck,我是捷克的,让我们使用ISO-8859-2:突然,该程序在ISO-8859-2获得输入,但由于它只是反流,它不重要,该程序仍将正确执行。

But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.

现在,如果该示例已读入我的名称,然后尝试写入一个XML文件,并愚蠢地写<?xml version =1.0encoding =UTF-8?> 在顶部,当我的终端是UTF-8,当我的终端在ISO-8859-2。在后一种情况下,需要在将其序列化为XML文件之前对其进行转换。 (或者,只是写ISO-8859-2作为XML文件的编码。)

Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)

在许多POSIX系统上,当前语言环境通常是UTF-为用户提供了几个优点,但这不是保证。只输出UTF-8到 stdout 通常是正确的,但不总是正确的。说我使用ISO-8859-2:如果你无意间输出一个ISO-8859-1è( 0xE8 )到我的终端,我会看到一个č ( 0xE8 )。同样,如果您输出一个UTF-8è( 0xC3 0xA8 ),我会看到(ISO-8859-2)è( 0xC3 0xA8 )。此错误字符的错误已称为 Mojibake

On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.

经常,你只是改变数据周围,这并不重要。这通常在需要序列化数据时起作用。 (许多互联网协议使用UTF-8或UTF-16,例如:如果您从ISO-8859-2终端获得数据或在Windows-1252中编码的文本文件,那么您必须转换它,否则您将发送 Mojibake 。)

Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)

很遗憾,这是关于状态的Unicode支持,在C和C ++。你必须记住:这些语言是真正的系统不可知的,并没有约束任何特定的方式做。这包括字符集。有很多的图书馆,但是,在处理Unicode和其他字符集。

Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.

最后,这不是所有的复杂真的:知道什么编码你的数据是在,并知道你的输出应该是什么编码。如果他们不一样,你需要做一个转换。无论您是使用 std :: cout 还是 std :: wcout ,这都适用。在我的示例中, stdin std :: cin stdout / std :: cout 有时为UTF-8,有时为ISO-8859-2。

In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.

这篇关于如何在C ++中使用Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆